Multi-resolution switched audio encoding/decoding scheme

ABSTRACT

An audio encoder for encoding an audio signal has a first coding branch, the first coding branch comprising a first converter for converting a signal from a time domain into a frequency domain. Furthermore, the audio encoder has a second coding branch comprising a second time/frequency converter. Additionally, a signal analyzer for analyzing the audio signal is provided. The signal analyzer, on the hand, determines whether an audio portion is effective in the encoder output signal as a first encoded signal from the first encoding branch or as a second encoded signal from a second encoding branch. On the other hand, the signal analyzer determines a time/frequency resolution to be applied by the converters when generating the encoded signals. An output interface includes, in addition to the first encoded signal and the second encoded signal, a resolution information identifying the resolution used by the first time/frequency converter and used by the second time/frequency converter.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of copending U.S. application Ser.No. 13/081,223, filed Apr. 6, 2011, which is incorporated herein byreference in its entirety, which is a continuation of copendingInternational Application No. PCT/EP2009/007205, filed Oct. 7, 2009,which is incorporated herein by reference in its entirety, andadditionally claims priority from European Applications Nos. 09002271.6,filed Feb. 18, 2009, and EP 08017663.9, filed Oct. 8, 2008 and U.S.patent application Ser. No. 61/103,825, filed Oct. 8, 2008, which areall incorporated herein by reference in their entirety.

BACKGROUND OF THE INVENTION

The present invention is related to audio coding and, particularly, tolow bit rate audio coding schemes.

In the art, frequency domain coding schemes such as MP3 or AAC areknown. These frequency-domain encoders are based on atime-domain/frequency-domain conversion, a subsequent quantizationstage, in which the quantization error is controlled using informationfrom a perceptual module, and an encoding stage, in which the quantizedspectral coefficients and corresponding side information areentropy-encoded using code tables.

On the other hand there are encoders that are very well suited to speechprocessing such as the AMR-WB+ as described in 3GPP TS 26.290. Suchspeech coding schemes perform a Linear Predictive filtering of atime-domain signal. Such a LP filtering is derived from a LinearPrediction analysis of the input time-domain signal. The resulting LPfilter coefficients are then quantized/coded and transmitted as sideinformation. The process is known as Linear Prediction Coding (LPC). Atthe output of the filter, the prediction residual signal or predictionerror signal which is also known as the excitation signal is encodedusing the analysis-by-synthesis stages of the ACELP encoder or,alternatively, is encoded using a transform encoder, which uses aFourier transform with an overlap. The decision between the ACELP codingand the Transform Coded eXcitation coding which is also called TCXcoding is done using a closed loop or an open loop algorithm.

Frequency-domain audio coding schemes such as the High Efficiency AAC(HE-ACC) encoding scheme, which combines an AAC coding scheme and aspectral band replication (SBR) technique can also be combined with ajoint stereo or a multi-channel coding tool which is known under theterm “MPEG surround”.

On the other hand, speech encoders such as the AMR-WB+ also have a highfrequency extension stage and a stereo functionality.

Frequency-domain coding schemes are advantageous in that they show ahigh quality at low bitrates for music signals. Problematic, however, isthe quality of speech signals at low bitrates.

Speech coding schemes show a high quality for speech signals even at lowbitrates, but show a poor quality for other signals at low bitrates.

SUMMARY

According to an embodiment an audio encoder for encoding an audio signalmay have a first coding branch for encoding an audio signal using afirst coding algorithm to acquire a first encoded signal, the firstcoding branch having the first converter for converting an input signalinto a spectral domain; a second coding branch for encoding an audiosignal using a second coding algorithm to acquire a second encodedsignal, wherein the first coding algorithm is different from the secondcoding algorithm, the second coding branch having a domain converter forconverting an input signal from an input domain into an output domain,and a second converter for converting an input signal into a spectraldomain; a switch for switching between the first coding branch and thesecond coding branch so that, for a portion of the audio input signal,either the first encoded signal or the second encoded signal is in anencoder output signal; a signal analyzer for analyzing the portion ofthe audio signal to determine, whether the portion of the audio signalis represented as the first encoded signal or the second encoded signalin the encoder output signal, wherein the signal analyzer is furthermoreconfigured for variably determining a respective time/frequencyresolution of the first converter and the second converter, when thefirst encoded signal or the second encoded signal representing theportion of the audio signal is generated; and an output interface forgenerating an encoder output signal having the first encoded signal andthe second encoded signal and an information indicating the firstencoded signal and the second encoded signal, and an informationindicating the time/frequency resolution applied for encoding the firstencoded signal and for encoding the second encoded signal.

According to another embodiment, a method of audio encoding an audiosignal may have the steps of encoding, in a first coding branch, anaudio signal using a first coding algorithm to acquire a first encodedsignal, the first coding branch having the first converter forconverting an input signal into a spectral domain; encoding, in a secondcoding branch, an audio signal using a second coding algorithm toacquire a second encoded signal, wherein the first coding algorithm isdifferent from the second coding algorithm, the second coding branchhaving a domain converter for converting an input signal from an inputdomain into an output domain, and a second converter for converting aninput signal into a spectral domain; switching between the first codingbranch and the second coding branch so that, for a portion of the audioinput signal, either the first encoded signal or the second encodedsignal is in an encoder output signal; analyzing the portion of theaudio signal to determine, whether the portion of the audio signal isrepresented as the first encoded signal or the second encoded signal inthe encoder output signal, variably determining a respectivetime/frequency resolution of the first converter and the secondconverter, when the first encoded signal or the second encoded signalrepresenting the portion of the audio signal is generated; andgenerating an encoder output signal having the first encoded signal andthe second encoded signal and an information indicating the firstencoded signal and the second encoded signal, and an informationindicating the time/frequency resolution applied for encoding the firstencoded signal and for encoding the second encoded signal.

According to another embodiment an audio decoder for decoding an encodedsignal, the encoded signal having a first encoded signal, a secondencoded signal, an indication indicating the first encoded signal andthe second encoded signal, and a time/frequency resolution informationto be used for decoding the first encoded signal and the second encodedaudio signal, which may have a first decoding branch for decoding thefirst encoded signal using a first controllable frequency/timeconverter, the first controllable frequency/time converter beingconfigured for being controlled using the time/frequency resolutioninformation for the first encoded signal to acquire a first decodedsignal; a second decoding branch for decoding the second encoded signalusing a second controllable frequency/time converter, the secondcontrollable frequency/time converter being configured for beingcontrolled using the time/frequency resolution information for thesecond encoded signal; a controller for controlling the firstfrequency/time converter and the second frequency/time converter usingthe time/frequency resolution information; a domain converter forgenerating a synthesis signal using the second decoded signal; and acombiner for combining the first decoded signal and the synthesis signalto acquire a decoded audio signal.

According to another embodiment a method of audio decoding an encodedsignal, the encoded signal having a first encoded signal, a secondencoded signal, an indication indicating the first encoded signal andthe second encoded signal, and a time/frequency resolution informationto be used for decoding the first encoded signal and the second encodedaudio signal, wherein the method may have the steps of decoding, by afirst decoding branch, the first encoded signal using a firstcontrollable frequency/time converter, the first controllablefrequency/time converter being configured for being controlled using thetime/frequency resolution information for the first encoded signal toacquire a first decoded signal; decoding, by a second decoding branch,the second encoded signal using a second controllable frequency/timeconverter, the second controllable frequency/time converter beingconfigured for being controlled using the time/frequency resolutioninformation for the second encoded signal; controlling the firstfrequency/time converter and the second frequency/time converter usingthe time/frequency resolution information; generating, by a domainconverter, a synthesis signal using the second decoded signal; andcombining the first decoded signal and the synthesis signal to acquire adecoded audio signal.

According to another embodiment, an encoded audio signal may have afirst encoded signal; a second encoded signal, wherein a portion of anaudio signal is either represented by the first encoded signal or thesecond encoded signal; an indication indicating the first encoded signaland the second encoded signal; an indication of a first time/frequencyresolution information to be used for decoding the first encoded signal,and an indication of a second time/frequency resolution information tobe used for decoding the second encoded signal.

Another embodiment may have a computer program for performing, whenrunning on a processor, one of the above mentioned methods.

The present invention is based on the finding that a hybrid or dual-modeswitched coding/encoding scheme is advantageous in that the best codingalgorithm can be selected for a certain signal characteristic. Stateddifferently, the present invention does not look for a signal codingalgorithm which is perfectly matched to all signal characteristics. Suchscheme would be a compromise as can be seen from the huge differencesbetween state of the art audio encoders on the one hand, and speechencoders on the other hand. Instead, the present invention combinesdifferent coding algorithms such as a speech coding algorithm on the onehand, and an audio coding algorithm on the other hand within a switchedscheme so that, for each audio signal portion, the optimally matchingcoding algorithm is selected. Furthermore, it is also a feature of thepresent invention that both coding branches comprise a time/frequencyconverter, but in one coding branch, a further domain converter such anLPC processor is provided. This domain converter makes sure that thesecond coding branch is better suited for a certain signalcharacteristic than the first coding branch. However, it is also afeature of the present invention that the signal output by the domainprocessor is also transformed into a spectral representation.

Both converters, i.e., the first converter in the first coding branchand the second converter in the second coding branch are configured forapplying a multi-resolution transform coding, where the resolution ofthe corresponding converter is set dependent on the audio signal, andparticularly dependent on the audio signal actually coded in thecorresponding coding branch so that a good compromise between quality onthe one hand, and bitrate on the other hand, or in view of a certainfixed quality, the lowest bitrate, or in view of a fixed bitrate, thehighest quality is obtained.

In accordance with the present invention, the time/frequency resolutionof the two converters can advantageously be set independent from eachother so that each time/frequency transformer can be optimally matchedto the time/frequency resolution requirements of the correspondingsignal. The bit efficiency, i.e., the relation between useful bits onthe one hand, and side information bits on the other hand is higher forlonger block sizes/window lengths. Therefore, it is advantageous thatboth converters are more biased to a longer window length, since,basically the same amount of side information refers to a longer timeportion of the audio signal compared to applying shorter blocksizes/window lengths/transform lengths. Advantageously, thetime/frequency resolution in the encoding branches can also beinfluenced by other encoding/decoding tools located in these branches.Advantageously, the second coding branch comprising the domain convertersuch as an LPC processor comprises another hybrid scheme such as anACELP branch on the one hand, and an TCX scheme on the other hand, wherethe second converter is included in the TCX scheme. Advantageously, theresolution of the time/frequency converter located in the TCX branch isalso influenced by the encoding decision, so that a portion of thesignal in the second encoding branch is processed in the TCX branchhaving the second converter or in the ACELP branch not having atime/frequency converter.

Basically, neither the domain converter nor the second coding branch,and particularly the first processing branch in the second encodingbranch and the second processing branch in the second coding branch,have to be speech-related elements such as an LPC analyzer for thedomain converter, a TCX encoder for the second processing branch and anACELP encoder for the first processing branch. Other applications arealso useful when other signal characteristics of an audio signaldifferent from speech on the one hand, and music on the other hand areevaluated. Any domain converters and encoding branch implementations canbe used and the best matching algorithm can be found by ananalysis-by-synthesis scheme so that, on the encoder side, for eachportion of the audio signal, all encoding alternatives are conducted andthe best result is selected, where the best result can be found applyinga target function to the encoding results. Then, side informationidentifying, to a decoder, the underlying encoding algorithm for acertain portion of the encoded audio signal is attached to the encodedaudio signal by an encoder output interface so that the decoder does nothave to care for any decisions on the encoder side or on any signalcharacteristics, but simply selects its coding branch depending on thetransmitted side information. Furthermore, the decoder will not onlyselect the correct decoding branch, but will also select, based on sideinformation encoded in the encoded signal, which time/frequencyresolution is to be applied in a corresponding first decoding branch anda corresponding second decoding branch.

Thus, the present invention provides an encoding/decoding scheme, whichcombines the advantages of all different coding algorithms and avoidsthe disadvantages of these coding algorithms which come up, when thesignal portion would have to be encoded, by an algorithm that does notfit to a certain coding algorithm. Furthermore, the present inventionavoids any disadvantages, which would come up, if the differenttime/frequency resolution requirements raised by different audio signalportions in different encoding branches had not been accounted for.Instead, due to the variable time/frequency resolution of time/frequencyconverters in both branches, any artifacts are at least reduced or evencompletely avoided, which would come up in the scenario where the sametime/frequency resolution would be applied for both coding branches, orin which only a fixed time/frequency resolution would be possible forany coding branches.

The second switch again decides between two processing branches, but ina domain different from the “outer” first branch domain. Again one“inner” branch is mainly motivated by a source model or by SNRcalculations, and the other “inner” branch can be motivated by a sinkmodel and/or a psycho acoustic model, i.e. by masking or at leastincludes frequency/spectral domain coding aspects. Exemplarily, one“inner” branch has a frequency domain encoder/spectral converter and theother branch has an encoder coding on the other domain such as the LPCdomain, wherein this encoder is for example an CELP or ACELPquantizer/scaler processing an input signal without a spectralconversion.

A further embodiment is an audio encoder comprising a first informationsink oriented encoding branch such as a spectral domain encoding branch,a second information source or SNR oriented encoding branch such as anLPC-domain encoding branch, and a switch for switching between the firstencoding branch and the second encoding branch, wherein the secondencoding branch comprises a converter into a specific domain differentfrom the time domain such as an LPC analysis stage generating anexcitation signal, and wherein the second encoding branch furthermorecomprises a specific domain such as LPC domain processing branch and aspecific spectral domain such as LPC spectral domain processing branch,and an additional switch for switching between the specific domaincoding branch and the specific spectral domain coding branch.

A further embodiment of the invention is an audio decoder comprising afirst domain such as a spectral domain decoding branch, a second domainsuch as an LPC domain decoding branch for decoding a signal such as anexcitation signal in the second domain, and a third domain such as anLPC-spectral decoder branch for decoding a signal such as an excitationsignal in a third domain such as an LPC spectral domain, wherein thethird domain is obtained by performing a frequency conversion from thesecond domain wherein a first switch for the second domain signal andthe third domain signal is provided, and wherein a second switch forswitching between the first domain decoder and the decoder for thesecond domain or the third domain is provided.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the present invention are subsequently described withrespect to the attached drawings, in which:

FIG. 1 a is a block diagram of an encoding scheme in accordance with afirst aspect of the present invention;

FIG. 1 b is a block diagram of a decoding scheme in accordance with thefirst aspect of the present invention;

FIG. 1 c is a block diagram of an encoding scheme in accordance with afurther aspect of the present invention;

FIG. 2 a is a block diagram of an encoding scheme in accordance with asecond aspect of the present invention;

FIG. 2 b is a schematic diagram of a decoding scheme in accordance withthe second aspect of the present invention.

FIG. 2 c is a block diagram of an encoding scheme in accordance with afurther aspect of the present invention

FIG. 3 a illustrates a block diagram of an encoding scheme in accordancewith a further aspect of the present invention;

FIG. 3 b illustrates a block diagram of a decoding scheme in accordancewith the further aspect of the present invention;

FIG. 3 c illustrates a schematic representation of the encodingapparatus/method with cascaded switches;

FIG. 3 d illustrates a schematic diagram of an apparatus or method fordecoding, in which cascaded combiners are used;

FIG. 3 e illustrates an illustration of a time domain signal and acorresponding representation of the encoded signal illustrating shortcross fade regions which are included in both encoded signals;

FIG. 4 a illustrates a block diagram with a switch positioned before theencoding branches;

FIG. 4 b illustrates a block diagram of an encoding scheme with theswitch positioned subsequent to encoding the branches;

FIG. 5 a illustrates a wave form of a time domain speech segment as aquasi-periodic or impulse-like signal segment;

FIG. 5 b illustrates a spectrum of the segment of FIG. 5 a;

FIG. 5 c illustrates a time domain speech segment of unvoiced speech asan example for a noise-like segment;

FIG. 5 d illustrates a spectrum of the time domain wave form of FIG. 5c;

FIG. 6 illustrates a block diagram of an analysis by synthesis CELPencoder;

FIGS. 7 a to 7 d illustrate voiced/unvoiced excitation signals as anexample for impulse-like signals;

FIG. 7 e illustrates an encoder-side LPC stage providing short-termprediction information and the prediction error (excitation) signal;

FIG. 7 f illustrates a further embodiment of an LPC device forgenerating a weighted signal;

FIG. 7 g illustrates an implementation for transforming a weightedsignal into an excitation signal by applying an inverse weightingoperation and a subsequent excitation analysis as needed in theconverter 537 of FIG. 2 b;

FIG. 8 illustrates a block diagram of a joint multi-channel algorithm inaccordance with an embodiment of the present invention;

FIG. 9 illustrates an embodiment of a bandwidth extension algorithm;

FIG. 10 a illustrates a detailed description of the switch whenperforming an open loop decision; and

FIG. 10 b illustrates an illustration of the switch when operating in aclosed loop decision mode;

FIG. 11A illustrates a block diagram of an audio encoder in accordancewith another aspect of the present invention;

FIG. 11B illustrates a block diagram of another embodiment of aninventive audio decoder;

FIG. 12A illustrates another embodiment of an inventive encoder;

FIG. 12B illustrates another embodiment of an inventive decoder;

FIG. 13A illustrates the interrelation between resolution andwindow/transform lengths;

FIG. 13B illustrates an overview of a set of transform windows for thefirst coding branch and a transition from the first to the second codingbranch;

FIG. 13C illustrates a plurality of different window sequences includingwindow sequences for the first coding branch and sequences for atransition to the second branch;

FIG. 14A illustrates the framing of an embodiment of the second codingbranch;

FIG. 14B illustrates short windows as applied in the second codingbranch;

FIG. 14C illustrates medium sized windows applied in the second codingbranch;

FIG. 14D illustrates long windows applied by the second coding branch;

FIG. 14E illustrates an exemplary sequence of ACELP frames and TCXframes within a super frame division;

FIG. 14F illustrates different transform lengths corresponding todifferent time/frequency resolutions for the second encoding branch; and

FIG. 14G illustrates a construction of a window using the definitions ofFIG. 14F.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 11A illustrates an embodiment of an audio encoder for encoding anaudio signal. The encoder comprise a first coding branch 400 forencoding an audio signal using a first coding algorithm to obtain afirst encoded signal.

The audio encoder furthermore comprises a second coding branch 500 forencoding an audio signal using a second coding algorithm to obtain asecond encoded signal. The first coding algorithm is different from thesecond coding algorithm. Additionally, a first switch 200 for switchingbetween the first coding branch and the second coding branch is providedso that, for a portion of the audio signal, either the first encodedsignal or the second encoded signal is in an encoder output signal 801.

The audio encoder illustrated in FIG. 11A additionally comprises asignal analyzer 300/525, which is configured for analyzing a portion ofthe audio signal to determine, whether the portion of the audio signalis represented as the first encoded signal or the second encoded signalin the encoder output signal 801.

The signal analyzer 300/525 is furthermore configured for variablydetermining a respective time/frequency resolution of a first converter410 in the first coding branch 400 or a second converter 523 in thesecond encoding branch 500. This time/frequency resolution is applied,when the first encoded signal or the second encoded signal representingthe portion of the audio signal is generated.

The audio encoder additionally comprises an output interface 800 forgenerating the encoder output signal 801 comprising an encodedrepresentation of the portion of the audio signal and an informationindicating whether the representation of the audio signal is the firstencoded signal or the second encoded signal, and indicating thetime/frequency resolution used for decoding the first encoded signal andthe second encoded signal.

The second encoding branch is different from the first encoding branchin that the second encoding branch additionally comprises a domainconverter for converting the audio signal from the domain, in which theaudio signal is processed in the first encoding branch into a differentdomain. Advantageously the domain converter is an LPC processor 510, butthe domain converter can be implemented in any other way as long as thedomain converter is different from the first converter 410 and thesecond converter 523.

The first converter 410 is a time/frequency converter advantageouslycomprising a windower 410 a and a transformer 410 b. The windower 410 aapplies an analysis window to the input audio signal, and thetransformer 410 b performs a conversion of the windowed signal into aspectral representation.

Analogously, the second converter 523 advantageously comprises awindower 523 a and a subsequently connected transformer 523 b. Thewindower 523 a receives the signal output by the domain converter 510and outputs the windowed representation thereof. The result of oneanalysis window applied by the windower 523 a is input into thetransformer 523 b to form a spectral representation. The transformer canbe an FFT or advantageously MDCT processor implementing a correspondingalgorithm in software or hardware or in a mixed hardware/softwareimplementation. Alternatively, the transformer can be a filterbankimplementation such as a QMF filterbank which can be based on areal-valued or complex modulation of a prototype filter. For specificfilterbank implementations, a window is applied. However, for otherfilterbank implementations, a windowing as needed for a transformalgorithm based on a FFT of MDCT is not necessary. When a filterbankimplementation is used, then the filterbank is a variable resolutionfilterbank and the resolution controls the frequency resolution of thefilterbank, and additionally, the time resolution or only the frequencyresolution and not the time resolution. When however, the converter isimplemented as an FFT or MDCT or any other corresponding transformer,then the frequency resolution is connected to the time resolution inthat an increase of the frequency resolution obtained by a larger blocklength in time automatically corresponds to a lower time resolution andvice versa.

Additionally, the first coding branch may comprise a quantizer/coderstage 421, and the second encoding branch may also comprise one or morefurther coding tools 524.

Importantly, the signal analyzer is configured for generating aresolution control signal for the first converter 510 and for the secondconverter 523. Thus, an independent resolution control in both codingbranches is implemented in order to have a coding scheme which, on theone hand, provides a low bitrate, and on the other hand, provides amaximum quality in view of the low bitrate. In order to achieve the lowbitrate goal, longer window lengths or longer transform lengths areadvantageous, but in situations where these long lengths will result inan artifact due to the low time resolution, shorter window lengths andshorter transform lengths are applied, which results in a lowerfrequency resolution. Advantageously, the signal analyzer applies astatistical analysis or any other analysis which is suited to thecorresponding algorithms in the encoding branches. In one implementationmode, in which the first coding branch is a frequency domain codingbranch such as an AAC-based encoder, and in which the second codingbranch comprises, as a domain converter, an LPC processor 510, thesignal analyzer performs a speech/music discrimination so that thespeech portion of the audio signal is fed into the second coding branchby correspondingly controlling the switch 200. A music portion of theaudio signal is fed into the first coding branch 400 by correspondinglycontrolling the switch 200 as indicated by the switch control lines.Alternatively, as will be later discussed with respect to FIG. 1C orFIG. 4B, the switch can also be positioned before the output interface800.

Furthermore, the signal analyzer can receive the audio signal input intothe switch 200, or the audio signal output by the switch 200.Furthermore, the signal analyzer performs an analysis in order to notonly feed the audio signal into the corresponding coding branch, but toalso determine the appropriate time/frequency resolution of therespective converter in the corresponding coding branch, such as thefirst converter 410 and the second converter 523 as indicated by theresolution controlled lines connecting the signal analyzer and theconverter.

FIG. 11B comprises an embodiment of an audio decoder matching to theaudio encoder in FIG. 11A.

The audio decoder in FIG. 11B is configured for decoding an encodedaudio signal such as the encoder output signal 801 output by the outputinterface 800 in FIG. 11A. The encoded signal comprises a first encodedaudio signal encoded in accordance with a first coding algorithm, asecond encoded signal encoded in accordance with a second codingalgorithm, the second coding algorithm being different from the firstcoding algorithm, and information, indicating whether the first codingalgorithm or the second coding algorithm is used for decoding the firstencoded signal and the second encoded signal, and a time/frequencyresolution information for the first encoded audio signal and the secondencoded audio signal.

The audio decoder comprises a first decoding branch 431, 440 fordecoding the first encoded signal based on the first coding algorithm.Furthermore, the audio decoder comprises a second decoding branch fordecoding the second encoded signal using the second coding algorithm.

The first decoding branch comprises a first controllable converter 440for converting from a spectral domain into the time domain. Thecontrollable converter is configured for being controlled using thetime/frequency resolution information from the first encoded signal toobtain the first decoded signal.

The second decoding branch comprises a second controllable converter forconverting from a spectral representation in a time representation, thesecond controllable converter 534 being configured for being controlledusing the time/frequency resolution information 991 for the secondencoded signal.

The decoder additionally comprises a controller 990 for controlling thefirst converter 540 and the second converter 534 in accordance with thetime/frequency resolution information 991.

Furthermore, the decoder comprises a domain converter for generating asynthesis signal using the second decoded signal in order to cancel thedomain conversion applied by the domain converter 510 in the encoder ofFIG. 11A.

Advantageously, the domain converter 540 is an LPC synthesis processor,which is controlled using LPC filter information included in the encodedsignal, where this LPC filter information has been generated by the LPCprocessor 510 in FIG. 11A and has been input into the encoder outputsignal as side information. The audio decoder finally comprises acombiner 600 for combining the first decoded signal output by the firstdomain converter 440 and the synthesis signal to obtain a decoded audiosignal 609.

In the implementation, the first decoding branch additionally comprisesa dequantizer/decoder stage 431 for reversing or at least for partlyreversing the operations performed by the corresponding encoder stage421. However, it is clear that quantization cannot be reversed, sincethis is a lossy operation. However, a dequantizer will reverse a certainnon-uniformity in a quantization such as a logarithmic or compandingquantization.

In the second decoding branch, the corresponding stage 533 is appliedfor undoing certain encoding operations applied by the stage 524.Advantageously, stage 524 comprises a uniform quantization. Therefore,the corresponding stage 533 will not have a specific dequantizationstage for undoing a certain uniform quantization.

The first converter 440 as well as the second converter 534 may comprisea corresponding inverse transformer stage 440 a, 534 a, a synthesiswindow stage 440 b, 534 b, and the subsequently connected overlap/addstage 440 c, 534 c. The overlap/add stages are needed, when theconverters, and more specifically, the transformer stages 440 a, 534 aapply aliasing introducing transforms such as a modified discrete cosinetransform. Then, the overlap/add operation will perform a time domainaliasing cancellation (TDAC). When however, the transformers apply anon-aliasing introducing transform such as an inverse FFT, then anoverlap/add stage 440 c is not required. In such an implementation, across fading operation to avoid blocking artifacts may be applied.

Analogously, the combiner 600 may be a switched combiner or a crossfading combiner, or when aliasing is used for avoiding blockingartifacts, a transition windowing operation is implemented by thecombiner similar to an overlap/add stage within a branch itself.

FIG. 1 a illustrates an embodiment of the invention having two cascadedswitches. A mono signal, a stereo signal or a multi-channel signal isinput into the switch 200. The switch 200 is controlled by the decisionstage 300. The decision stage receives, as an input, a signal input intoblock 200. Alternatively, the decision stage 300 may also receive a sideinformation which is included in the mono signal, the stereo signal orthe multi-channel signal or is at least associated to such a signal,where information is existing, which was, for example, generated, whenoriginally producing the mono signal, the stereo signal or themulti-channel signal.

The decision stage 300 actuates the switch 200 in order to feed a signaleither in the frequency encoding portion 400 illustrated at an upperbranch of FIG. 1 a or the LPC-domain encoding portion 500 illustrated ata lower branch in FIG. 1 a. A key element of the frequency domainencoding branch is the spectral conversion block 410 which is operativeto convert a common preprocessing stage output signal (as discussedlater on) into a spectral domain. The spectral conversion block mayinclude an MDCT algorithm, a QMF, an FFT algorithm, a Wavelet analysisor a filterbank such as a critically sampled filterbank having a certainnumber of filterbank channels, where the subband signals in thisfilterbank may be real valued signals or complex valued signals. Theoutput of the spectral conversion block 410 is encoded using a spectralaudio encoder 421, which may include processing blocks as known from theAAC coding scheme.

Generally, the processing in branch 400 is a processing in a perceptionbased model or information sink model. Thus, this branch models thehuman auditory system receiving sound. Contrary thereto, the processingin branch 500 is to generate a signal in the excitation, residual or LPCdomain. Generally, the processing in branch 500 is a processing in aspeech model or an information generation model. For speech signals,this model is a model of the human speech/sound generation systemgenerating sound. If, however, a sound from a different source requiringa different sound generation model is to be encoded, then the processingin branch 500 may be different.

In the lower encoding branch 500, a key element is an LPC device 510,which outputs an LPC information which is used for controlling thecharacteristics of an LPC filter. This LPC information is transmitted toa decoder. The LPC stage 510 output signal is an LPC-domain signal whichconsists of an excitation signal and/or a weighted signal.

The LPC device generally outputs an LPC domain signal, which can be anysignal in the LPC domain such as the excitation signal in FIG. 7 e or aweighted signal in FIG. 7 f or any other signal, which has beengenerated by applying LPC filter coefficients to an audio signal.Furthermore, an LPC device can also determine these coefficients and canalso quantize/encode these coefficients.

The decision in the decision stage can be signal-adaptive so that thedecision stage performs a music/speech discrimination and controls theswitch 200 in such a way that music signals are input into the upperbranch 400, and speech signals are input into the lower branch 500. Inone embodiment, the decision stage is feeding its decision informationinto an output bit stream so that a decoder can use this decisioninformation in order to perform the correct decoding operations.

Such a decoder is illustrated in FIG. 1 b. The signal output by thespectral audio encoder 421 is, after transmission, input into a spectralaudio decoder 431. The output of the spectral audio decoder 431 is inputinto a time-domain converter 440. Analogously, the output of the LPCdomain encoding branch 500 of FIG. 1 a is received on the decoder sideand processed by elements 531, 533, 534, and 532 for obtaining an LPCexcitation signal. The LPC excitation signal is input into an LPCsynthesis stage 540, which receives, as a further input, the LPCinformation generated by the corresponding LPC analysis stage 510. Theoutput of the time-domain converter 440 and/or the output of the LPCsynthesis stage 540 are input into a switch 600. The switch 600 iscontrolled via a switch control signal which was, for example, generatedby the decision stage 300, or which was externally provided such as by acreator of the original mono signal, stereo signal or multi-channelsignal. The output of the switch 600 is a complete mono signal, stereosignal or multichannel signal.

The input signal into the switch 200 and the decision stage 300 can be amono signal, a stereo signal, a multi-channel signal or generally anaudio signal. Depending on the decision which can be derived from theswitch 200 input signal or from any external source such as a producerof the original audio signal underlying the signal input into stage 200,the switch switches between the frequency encoding branch 400 and theLPC encoding branch 500. The frequency encoding branch 400 comprises aspectral conversion stage 410 and a subsequently connectedquantizing/coding stage 421. The quantizing/coding stage can include anyof the functionalities as known from modern frequency-domain encoderssuch as the AAC encoder. Furthermore, the quantization operation in thequantizing/coding stage 421 can be controlled via a psychoacousticmodule which generates psychoacoustic information such as apsychoacoustic masking threshold over the frequency, where thisinformation is input into the stage 421.

In the LPC encoding branch, the switch output signal is processed via anLPC analysis stage 510 generating LPC side info and an LPC-domainsignal. The excitation encoder inventively comprises an additionalswitch for switching the further processing of the LPC-domain signalbetween a quantization/coding operation 522 in the LPC-domain or aquantization/coding stage 524, which is processing values in theLPC-spectral domain. To this end, a spectral converter 523 is providedat the input of the quantizing/coding stage 524. The switch 521 iscontrolled in an open loop fashion or a closed loop fashion depending onspecific settings as, for example, described in the AMR-WB+ technicalspecification.

For the closed loop control mode, the encoder additionally includes aninverse quantizer/coder 531 for the LPC domain signal, an inversequantizer/coder 533 for the LPC spectral domain signal and an inversespectral converter 534 for the output of item 533. Both encoded andagain decoded signals in the processing branches of the second encodingbranch are input into the switch control device 525. In the switchcontrol device 525, these two output signals are compared to each otherand/or to a target function or a target function is calculated which maybe based on a comparison of the distortion in both signals so that thesignal having the lower distortion is used for deciding, which positionthe switch 521 should take. Alternatively, in case both branches providenon-constant bit rates, the branch providing the lower bit rate might beselected even when the signal to noise ratio of this branch is lowerthan the signal to noise ratio of the other branch. Alternatively, thetarget function could use, as an input, the signal to noise ratio ofeach signal and a bit rate of each signal and/or additional criteria inorder to find the best decision for a specific goal. If, for example,the goal is such that the bit rate should be as low as possible, thenthe target function would heavily rely on the bit rate of the twosignals output by the elements 531, 534. However, when the main goal isto have the best quality for a certain bit rate, then the switch control525 might, for example, discard each signal which is above the allowedbit rate and when both signals are below the allowed bit rate, theswitch control would select the signal having the better signal to noiseratio, i.e., having the smaller quantization/coding distortions.

The decoding scheme in accordance with the present invention is, asstated before, illustrated in FIG. 1 b. For each of the three possibleoutput signal kinds, a specific decoding/re-quantizing stage 431, 531 or533 exists. While stage 431 outputs a time-spectrum which is convertedinto the time-domain using the frequency/time converter 440, stage 531outputs an LPC-domain signal, and item 533 outputs an LPC-spectrum. Inorder to make sure that the input signals into switch 532 are both inthe LPC-domain, the LPC-spectrum/LPC-converter 534 is provided. Theoutput data of the switch 532 is transformed back into the time-domainusing an LPC synthesis stage 540, which is controlled via encoder-sidegenerated and transmitted LPC information. Then, subsequent to block540, both branches have time-domain information which is switched inaccordance with a switch control signal in order to finally obtain anaudio signal such as a mono signal, a stereo signal or a multi-channelsignal, which depends on the signal input into the encoding scheme ofFIG. 1 a.

FIG. 1 c illustrates a further embodiment with a different arrangementof the switch 521 similar to the principle of FIG. 4 b.

FIG. 2 a illustrates an encoding scheme in accordance with a secondaspect of the invention. A common preprocessing scheme connected to theswitch 200 input may comprise a surround/joint stereo block 101 whichgenerates, as an output, joint stereo parameters and a mono outputsignal, which is generated by downmixing the input signal which is asignal having two or more channels. Generally, the signal at the outputof block 101 can also be a signal having more channels, but due to thedownmixing functionality of block 101, the number of channels at theoutput of block 101 will be smaller than the number of channels inputinto block 101.

The common preprocessing scheme may comprise alternatively to the block101 or in addition to the block 101 a bandwidth extension stage 102. Inthe FIG. 2 a embodiment, the output of block 101 is input into thebandwidth extension block 102 which, in the encoder of FIG. 2 a, outputsa band-limited signal such as the low band signal or the low pass signalat its output. Advantageously, this signal is downsampled (e.g. by afactor of two) as well. Furthermore, for the high band of the signalinput into block 102, bandwidth extension parameters such as spectralenvelope parameters, inverse filtering parameters, noise floorparameters etc. as known from HE-AAC profile of MPEG-4 are generated andforwarded to a bitstream multiplexer 800.

Advantageously, the decision stage 300 receives the signal input intoblock 101 or input into block 102 in order to decide between, forexample, a music mode or a speech mode. In the music mode, the upperencoding branch 400 is selected, while, in the speech mode, the lowerencoding branch 500 is selected. Advantageously, the decision stageadditionally controls the joint stereo block 101 and/or the bandwidthextension block 102 to adapt the functionality of these blocks to thespecific signal. Thus, when the decision stage determines that a certaintime portion of the input signal is of the first mode such as the musicmode, then specific features of block 101 and/or block 102 can becontrolled by the decision stage 300. Alternatively, when the decisionstage 300 determines that the signal is in a speech mode or, generally,in a second LPC-domain mode, then specific features of blocks 101 and102 can be controlled in accordance with the decision stage output.

Advantageously, the spectral conversion of the coding branch 400 is doneusing an MDCT operation which, even more advantageously, is thetime-warped MDCT operation, where the strength or, generally, thewarping strength can be controlled between zero and a high warpingstrength. In a zero warping strength, the MDCT operation in block 411 isa straight-forward MDCT operation known in the art. The time warpingstrength together with time warping side information can betransmitted/input into the bitstream multiplexer 800 as sideinformation.

In the LPC encoding branch, the LPC-domain encoder may include an ACELPcore 526 calculating a pitch gain, a pitch lag and/or codebookinformation such as a codebook index and gain. The TCX mode as knownfrom 3GPP TS 26.290 incurs a processing of a perceptually weightedsignal in the transform domain. A Fourier transformed weighted signal isquantized using a split multi-rate lattice quantization (algebraic VQ)with noise factor quantization. A transform is calculated in 1024, 512,or 256 sample windows. The excitation signal is recovered by inversefiltering the quantized weighted signal through an inverse weightingfilter.

In the first coding branch 400, a spectral converter advantageouslycomprises a specifically adapted MDCT operation having certain windowfunctions followed by a quantization/entropy encoding stage which mayconsist of a single vector quantization stage, but advantageously is acombined scalar quantizer/entropy coder similar to the quantizer/coderin the frequency domain coding branch, i.e., in item 421 of FIG. 2 a.

In the second coding branch, there is the LPC block 510 followed by aswitch 521, again followed by an ACELP block 526 or an TCX block 527.ACELP is described in 3GPP TS 26.190 and TCX is described in 3GPP TS26.290. Generally, the ACELP block 526 receives an LPC excitation signalas calculated by a procedure as described in FIG. 7 e. The TCX block 527receives a weighted signal as generated by FIG. 7 f.

In TCX, the transform is applied to the weighted signal computed byfiltering the input signal through an LPC-based weighting filter. Theweighting filter used embodiments of the invention is given by(1−A(z/γ))/(1−μz⁻¹). Thus, the weighted signal is an LPC domain signaland its transform is an LPC-spectral domain. The signal processed byACELP block 526 is the excitation signal and is different from thesignal processed by the block 527, but both signals are in the LPCdomain.

At the decoder side illustrated in FIG. 2 b, after the inverse spectraltransform in block 537, the inverse of the weighting filter is applied,that is (1−μz⁻¹)/(1−A(z/γ)). Then, the signal is filtered through(1−A(z)) to go to the LPC excitation domain. Thus, the conversion to LPCdomain block 534 and the TCX⁻block 537 include inverse transform andthen filtering through

$\frac{( {1 - {\mu\; z^{- 1}}} )}{( {1 - {A( {z/\gamma} )}} )}( {1 - {A(z)}} )$to convert from the weighted domain to the excitation domain.

Although item 510 in FIGS. 1 a, 1 c, 2 a, 2 c illustrates a singleblock, block 510 can output different signals as long as these signalsare in the LPC domain. The actual mode of block 510 such as theexcitation signal mode or the weighted signal mode can depend on theactual switch state. Alternatively, the block 510 can have two parallelprocessing devices, where one device is implemented similar to FIG. 7 eand the other device is implemented as FIG. 7 f. Hence, the LPC domainat the output of 510 can represent either the LPC excitation signal orthe LPC weighted signal or any other LPC domain signal.

In the second encoding branch (ACELP/TCX) of FIG. 2 a or 2 c, the signalis advantageously pre-emphasized through a filter 1−0.68z⁻¹ beforeencoding. At the ACELP/TCX decoder in FIG. 2 b the synthesized signal isdeemphasized with the filter 1/(1−0.68z⁻¹). The preemphasis can be partof the LPC block 510 where the signal is preemphasized before LPCanalysis and quantization. Similarly, deemphasis can be part of the LPCsynthesis block LPC⁻¹ 540.

FIG. 2 c illustrates a further embodiment for the implementation of FIG.2 a, but with a different arrangement of the switch 521 similar to theprinciple of FIG. 4 b.

In an embodiment, the first switch 200 (see FIG. 1 a or 2 a) iscontrolled through an open-loop decision (as in FIG. 4 a) and the secondswitch is controlled through a closed-loop decision (as in FIG. 4 b).

For example, FIG. 2 c, has the second switch placed after the ACELP andTCX branches as in FIG. 4 b. Then, in the first processing branch, thefirst LPC domain represents the LPC excitation, and in the secondprocessing branch, the second LPC domain represents the LPC weightedsignal. That is, the first LPC domain signal is obtained by filteringthrough (1−A(z)) to convert to the LPC residual domain, while the secondLPC domain signal is obtained by filtering through the filter(1−A(z/γ))/(1−μz⁻¹) to convert to the LPC weighted domain.

FIG. 2 b illustrates a decoding scheme corresponding to the encodingscheme of FIG. 2 a. The bitstream generated by bitstream multiplexer 800of FIG. 2 a is input into a bitstream demultiplexer 900. Depending on aninformation derived for example from the bitstream via a mode detectionblock 601, a decoder-side switch 600 is controlled to either forwardsignals from the upper branch or signals from the lower branch to thebandwidth extension block 701. The bandwidth extension block 701receives, from the bitstream demultiplexer 900, side information and,based on this side information and the output of the mode decision 601,reconstructs the high band based on the low band output by switch 600.

The full band signal generated by block 701 is input into the jointstereo/surround processing stage 702, which reconstructs two stereochannels or several multi-channels. Generally, block 702 will outputmore channels than were input into this block. Depending on theapplication, the input into block 702 may even include two channels suchas in a stereo mode and may even include more channels as long as theoutput by this block has more channels than the input into this block.

The switch 200 has been shown to switch between both branches so thatonly one branch receives a signal to process and the other branch doesnot receive a signal to process. In an alternative embodiment, however,the switch may also be arranged subsequent to for example the audioencoder 421 and the excitation encoder 522, 523, 524, which means thatboth branches 400, 500 process the same signal in parallel. In order tonot double the bitrate, however, only the signal output by one of thoseencoding branches 400 or 500 is selected to be written into the outputbitstream. The decision stage will then operate so that the signalwritten into the bitstream minimizes a certain cost function, where thecost function can be the generated bitrate or the generated perceptualdistortion or a combined rate/distortion cost function. Therefore,either in this mode or in the mode illustrated in the Figures, thedecision stage can also operate in a closed loop mode in order to makesure that, finally, only the encoding branch output is written into thebitstream which has for a given perceptual distortion the lowest bitrateor, for a given bitrate, has the lowest perceptual distortion. In theclosed loop mode, the feedback input may be derived from outputs of thethree quantizer/scaler blocks 421, 522 and 424 in FIG. 1 a.

In the implementation having two switches, i.e., the first switch 200and the second switch 521, it is advantageous that the time resolutionfor the first switch is lower than the time resolution for the secondswitch. Stated differently, the blocks of the input signal into thefirst switch, which can be switched via a switch operation are largerthan the blocks switched by the second switch operating in theLPC-domain. Exemplarily, the frequency domain/LPC-domain switch 200 mayswitch blocks of a length of 1024 samples, and the second switch 521 canswitch blocks having 256 samples each.

Although some of the FIGS. 1 a through 10 b are illustrated as blockdiagrams of an apparatus, these figures simultaneously are anillustration of a method, where the block functionalities correspond tothe method steps.

FIG. 3 a illustrates an audio encoder for generating an encoded audiosignal as an output of the first encoding branch 400 and a secondencoding branch 500. Furthermore, the encoded audio signal includes sideinformation such as pre-processing parameters from the commonpre-processing stage or, as discussed in connection with precedingFigs., switch control information.

Advantageously, the first encoding branch is operative in order toencode an audio intermediate signal 195 in accordance with a firstcoding algorithm, wherein the first coding algorithm has an informationsink model. The first encoding branch 400 generates the first encoderoutput signal which is an encoded spectral information representation ofthe audio intermediate signal 195.

Furthermore, the second encoding branch 500 is adapted for encoding theaudio intermediate signal 195 in accordance with a second encodingalgorithm, the second coding algorithm having an information sourcemodel and generating, in a second encoder output signal, encodedparameters for the information source model representing theintermediate audio signal.

The audio encoder furthermore comprises the common pre-processing stagefor pre-processing an audio input signal 99 to obtain the audiointermediate signal 195. Specifically, the common pre-processing stageis operative to process the audio input signal 99 so that the audiointermediate signal 195, i.e., the output of the common pre-processingalgorithm is a compressed version of the audio input signal.

A method of audio encoding for generating an encoded audio signal,comprises a step of encoding 400 an audio intermediate signal 195 inaccordance with a first coding algorithm, the first coding algorithmhaving an information sink model and generating, in a first outputsignal, encoded spectral information representing the audio signal; astep of encoding 500 an audio intermediate signal 195 in accordance witha second coding algorithm, the second coding algorithm having aninformation source model and generating, in a second output signal,encoded parameters for the information source model representing theintermediate signal 195, and a step of commonly pre-processing 100 anaudio input signal 99 to obtain the audio intermediate signal 195,wherein, in the step of commonly pre-processing the audio input signal99 is processed so that the audio intermediate signal 195 is acompressed version of the audio input signal 99, wherein the encodedaudio signal includes, for a certain portion of the audio signal eitherthe first output signal or the second output signal. The method includesthe further step encoding a certain portion of the audio intermediatesignal either using the first coding algorithm or using the secondcoding algorithm or encoding the signal using both algorithms andoutputting in an encoded signal either the result of the first codingalgorithm or the result of the second coding algorithm.

Generally, the audio encoding algorithm used in the first encodingbranch 400 reflects and models the situation in an audio sink. The sinkof an audio information is normally the human ear. The human ear can bemodeled as a frequency analyzer. Therefore, the first encoding branchoutputs encoded spectral information. Advantageously, the first encodingbranch furthermore includes a psychoacoustic model for additionallyapplying a psychoacoustic masking threshold. This psychoacoustic maskingthreshold is used when quantizing audio spectral values where,advantageously, the quantization is performed such that a quantizationnoise is introduced by quantizing the spectral audio values, which arehidden below the psychoacoustic masking threshold.

The second encoding branch represents an information source model, whichreflects the generation of audio sound. Therefore, information sourcemodels may include a speech model which is reflected by an LPC analysisstage, i.e., by transforming a time domain signal into an LPC domain andby subsequently processing the LPC residual signal, i.e., the excitationsignal. Alternative sound source models, however, are sound sourcemodels for representing a certain instrument or any other soundgenerators such as a specific sound source existing in real world. Aselection between different sound source models can be performed whenseveral sound source models are available, for example based on an SNRcalculation, i.e., based on a calculation, which of the source models isthe best one suitable for encoding a certain time portion and/orfrequency portion of an audio signal. Advantageously, however, theswitch between encoding branches is performed in the time domain, i.e.,that a certain time portion is encoded using one model and a certaindifferent time portion of the intermediate signal is encoded using theother encoding branch.

Information source models are represented by certain parameters.Regarding the speech model, the parameters are LPC parameters and codedexcitation parameters, when a modern speech coder such as AMR-WB+ isconsidered. The AMR-WB+ comprises an ACELP encoder and a TCX encoder. Inthis case, the coded excitation parameters can be global gain, noisefloor, and variable length codes.

FIG. 3 b illustrates a decoder corresponding to the encoder illustratedin FIG. 3 a. Generally, FIG. 3 b illustrates an audio decoder fordecoding an encoded audio signal to obtain a decoded audio signal 799.The decoder includes the first decoding branch 450 for decoding anencoded signal encoded in accordance with a first coding algorithmhaving an information sink model. The audio decoder furthermore includesa second decoding branch 550 for decoding an encoded information signalencoded in accordance with a second coding algorithm having aninformation source model. The audio decoder furthermore includes acombiner for combining output signals from the first decoding branch 450and the second decoding branch 550 to obtain a combined signal. Thecombined signal which is illustrated in FIG. 3 b as the decoded audiointermediate signal 699 is input into a common post processing stage forpost processing the decoded audio intermediate signal 699, which is thecombined signal output by the combiner 600 so that an output signal ofthe common pre-processing stage is an expanded version of the combinedsignal. Thus, the decoded audio signal 799 has an enhanced informationcontent compared to the decoded audio intermediate signal 699. Thisinformation expansion is provided by the common post processing stagewith the help of pre/post processing parameters which can be transmittedfrom an encoder to a decoder, or which can be derived from the decodedaudio intermediate signal itself. Advantageously, however, pre/postprocessing parameters are transmitted from an encoder to a decoder,since this procedure allows an improved quality of the decoded audiosignal.

FIG. 3 c illustrates an audio encoder for encoding an audio input signal195, which may be equal to the intermediate audio signal 195 of FIG. 3 ain accordance with the embodiment of the present invention. The audioinput signal 195 is present in a first domain which can, for example, bethe time domain but which can also be any other domain such as afrequency domain, an LPC domain, an LPC spectral domain or any otherdomain. Generally, the conversion from one domain to the other domain isperformed by a conversion algorithm such as any of the well-knowntime/frequency conversion algorithms or frequency/time conversionalgorithms.

An alternative transform from the time domain, for example in the LPCdomain is the result of LPC filtering a time domain signal which resultsin an LPC residual signal or excitation signal. Any other filteringoperations producing a filtered signal which has an impact on asubstantial number of signal samples before the transform can be used asa transform algorithm as the case may be. Therefore, weighting an audiosignal using an LPC based weighting filter is a further transform, whichgenerates a signal in the LPC domain. In a time/frequency transform, themodification of a single spectral value will have an impact on all timedomain values before the transform. Analogously, a modification of anytime domain sample will have an impact on each frequency domain sample.Similarly, a modification of a sample of the excitation signal in an LPCdomain situation will have, due to the length of the LPC filter, animpact on a substantial number of samples before the LPC filtering.Similarly, a modification of a sample before an LPC transformation willhave an impact on many samples obtained by this LPC transformation dueto the inherent memory effect of the LPC filter.

The audio encoder of FIG. 3 c includes a first coding branch 400 whichgenerates a first encoded signal. This first encoded signal may be in afourth domain which is, in the embodiment, the time-spectral domain,i.e., the domain which is obtained when a time domain signal isprocessed via a time/frequency conversion.

Therefore, the first coding branch 400 for encoding an audio signal usesa first coding algorithm to obtain a first encoded signal, where thisfirst coding algorithm may or may not include a time/frequencyconversion algorithm.

The audio encoder furthermore includes a second coding branch 500 forencoding an audio signal. The second coding branch 500 uses a secondcoding algorithm to obtain a second encoded signal, which is differentfrom the first coding algorithm.

The audio encoder furthermore includes a first switch 200 for switchingbetween the first coding branch 400 and the second coding branch 500 sothat for a portion of the audio input signal, either the first encodedsignal at the output of block 400 or the second encoded signal at theoutput of the second encoding branch is included in an encoder outputsignal. Thus, when for a certain portion of the audio input signal 195,the first encoded signal in the fourth domain is included in the encoderoutput signal, the second encoded signal which is either the firstprocessed signal in the second domain or the second processed signal inthe third domain is not included in the encoder output signal. Thismakes sure that this encoder is bit rate efficient. In embodiments, anytime portions of the audio signal which are included in two differentencoded signals are small compared to a frame length of a frame as willbe discussed in connection with FIG. 3 e. These small portions areuseful for a cross fade from one encoded signal to the other encodedsignal in the case of a switch event in order to reduce artifacts thatmight occur without any cross fade. Therefore, apart from the cross-faderegion, each time domain block is represented by an encoded signal ofonly a single domain.

As illustrated in FIG. 3 c, the second coding branch 500 comprises aconverter 510 for converting the audio signal in the first domain, i.e.,signal 195 into a second domain.

Furthermore, the second coding branch 500 comprises a first processingbranch 522 for processing an audio signal in the second domain to obtaina first processed signal which is, advantageously, also in the seconddomain so that the first processing branch 522 does not perform a domainchange.

The second encoding branch 500 furthermore comprises a second processingbranch 523, 524 which converts the audio signal in the second domaininto a third domain, which is different from the first domain and whichis also different from the second domain and which processes the audiosignal in the third domain to obtain a second processed signal at theoutput of the second processing branch 523, 524.

Furthermore, the second coding branch comprises a second switch 521 forswitching between the first processing branch 522 and the secondprocessing branch 523, 524 so that, for a portion of the audio signalinput into the second coding branch, either the first processed signalin the second domain or the second processed signal in the third domainis in the second encoded signal.

FIG. 3 d illustrates a corresponding decoder for decoding an encodedaudio signal generated by the encoder of FIG. 3 c. Generally, each blockof the first domain audio signal is represented by either a seconddomain signal, a third domain signal or a fourth domain encoded signalapart from an optional cross fade region which is, advantageously, shortcompared to the length of one frame in order to obtain a system which isas much as possible at the critical sampling limit. The encoded audiosignal includes the first coded signal, a second coded signal in asecond domain and a third coded signal in a third domain, wherein thefirst coded signal, the second coded signal and the third coded signalall relate to different time portions of the decoded audio signal andwherein the second domain, the third domain and the first domain for adecoded audio signal are different from each other.

The decoder comprises a first decoding branch for decoding based on thefirst coding algorithm. The first decoding branch is illustrated at 431,440 in FIG. 3 d and advantageously comprises a frequency/time converter.The first coded signal is advantageously in a fourth domain and isconverted into the first domain which is the domain for the decodedoutput signal.

The decoder of FIG. 3 d furthermore comprises a second decoding branchwhich comprises several elements. These elements are a first inverseprocessing branch 531 for inverse processing the second coded signal toobtain a first inverse processed signal in the second domain at theoutput of block 531. The second decoding branch furthermore comprises asecond inverse processing branch 533, 534 for inverse processing a thirdcoded signal to obtain a second inverse processed signal in the seconddomain, where the second inverse processing branch comprises a converterfor converting from the third domain into the second domain.

The second decoding branch furthermore comprises a first combiner 532for combining the first inverse processed signal and the second inverseprocessed signal to obtain a signal in the second domain, where thiscombined signal is, at the first time instant, only influenced by thefirst inverse processed signal and is, at a later time instant, onlyinfluenced by the second inverse processed signal.

The second decoding branch furthermore comprises a converter 540 forconverting the combined signal to the first domain.

Finally, the decoder illustrated in FIG. 3 d comprises a second combiner600 for combining the decoded first signal from block 431, 440 and theconverter 540 output signal to obtain a decoded output signal in thefirst domain. Again, the decoded output signal in the first domain is,at the first time instant, only influenced by the signal output by theconverter 540 and is, at a later time instant, only influenced by thefirst decoded signal output by block 431, 440.

This situation is illustrated, from an encoder perspective, in FIG. 3 e.The upper portion in FIG. 3 e illustrates in the schematicrepresentation, a first domain audio signal such as a time domain audiosignal, where the time index increases from left to right and item 3might be considered as a stream of audio samples representing the signal195 in FIG. 3 c. FIG. 3 e illustrates frames 3 a, 3 b, 3 c, 3 d whichmay be generated by switching between the first encoded signal and thefirst processed signal and the second processed signal as illustrated atitem 4 in FIG. 3 e. The first encoded signal, the first processed signaland the second processed signals are all in different domains and inorder to make sure that the switch between the different domains doesnot result in an artifact on the decoder-side, frames 3 a, 3 b of thetime domain signal have an overlapping range which is indicated as across fade region, and such a cross fade region is there at frame 3 band 3 c. However, no such cross fade region is existing between frame 3d, 3 c which means that frame 3 d is also represented by a secondprocessed signal, i.e., a signal in the third domain, and there is nodomain change between frame 3 c and 3 d. Therefore, generally, it isadvantageous not to provide a cross fade region where there is no domainchange and to provide a cross fade region, i.e., a portion of the audiosignal which is encoded by two subsequent coded/processed signals whenthere is a domain change, i.e., a switching action of either of the twoswitches. Advantageously, crossfades are performed for other domainchanges.

In the embodiment, in which the first encoded signal or the secondprocessed signal has been generated by an MDCT processing having e.g. 50percents overlap, each time domain sample is included in two subsequentframes. Due to the characteristics of the MDCT, however, this does notresult in an overhead, since the MDCT is a critically sampled system. Inthis context, critically sampled means that the number of spectralvalues is the same as the number of time domain values. The MDCT isadvantageous in that the crossover effect is provided without a specificcrossover region so that a crossover from an MDCT block to the next MDCTblock is provided without any overhead which would violate the criticalsampling requirement.

Advantageously, the first coding algorithm in the first coding branch isbased on an information sink model, and the second coding algorithm inthe second coding branch is based on an information source or an SNRmodel. An SNR model is a model which is not specifically related to aspecific sound generation mechanism but which is one coding mode whichcan be selected among a plurality of coding modes based e.g. on a closedloop decision. Thus, an SNR model is any available coding model butwhich does not necessarily have to be related to the physicalconstitution of the sound generator but which is any parameterizedcoding model different from the information sink model, which can beselected by a closed loop decision and, specifically, by comparingdifferent SNR results from different models.

As illustrated in FIG. 3 c, a controller 300, 525 is provided. Thiscontroller may include the functionalities of the decision stage 300 ofFIG. 1 a and, additionally, may include the functionality of the switchcontrol device 525 in FIG. 1 a. Generally, the controller is forcontrolling the first switch and the second switch in a signal adaptiveway. The controller is operative to analyze a signal input into thefirst switch or output by the first or the second coding branch orsignals obtained by encoding and decoding from the first and the secondencoding branch with respect to a target function. Alternatively, oradditionally, the controller is operative to analyze the signal inputinto the second switch or output by the first processing branch or thesecond processing branch or obtained by processing and inverseprocessing from the first processing branch and the second processingbranch, again with respect to a target function.

In one embodiment, the first coding branch or the second coding branchcomprises an aliasing introducing time/frequency conversion algorithmsuch as an MDCT or an MDST algorithm, which is different from astraightforward FFT transform, which does not introduce an aliasingeffect. Furthermore, one or both branches comprise a quantizer/entropycoder block. Specifically, only the second processing branch of thesecond coding branch includes the time/frequency converter introducingan aliasing operation and the first processing branch of the secondcoding branch comprises a quantizer and/or entropy coder and does notintroduce any aliasing effects. The aliasing introducing time/frequencyconverter advantageously comprises a windower for applying an analysiswindow and an MDCT transform algorithm. Specifically, the windower isoperative to apply the window function to subsequent frames in anoverlapping way so that a sample of a windowed signal occurs in at leasttwo subsequent windowed frames.

In one embodiment, the first processing branch comprises an ACELP coderand a second processing branch comprises an MDCT spectral converter andthe quantizer for quantizing spectral components to obtain quantizedspectral components, where each quantized spectral component is zero oris defined by one quantizer index of the plurality of different possiblequantizer indices.

Furthermore, it is advantageous that the first switch 200 operates in anopen loop manner and the second switch operates in a closed loop manner.

As stated before, both coding branches are operative to encode the audiosignal in a block wise manner, in which the first switch or the secondswitch switches in a block-wise manner so that a switching action takesplace, at the minimum, after a block of a predefined number of samplesof a signal, the predefined number forming a frame length for thecorresponding switch. Thus, the granule for switching by the firstswitch may be, for example, a block of 2048 or 1028 samples, and theframe length, based on which the first switch 200 is switching may bevariable but is, advantageously, fixed to such a quite long period.

Contrary thereto, the block length for the second switch 521, i.e., whenthe second switch 521 switches from one mode to the other, issubstantially smaller than the block length for the first switch.Advantageously, both block lengths for the switches are selected suchthat the longer block length is an integer multiple of the shorter blocklength. In the embodiment, the block length of the first switch is 2048or 1024 and the block length of the second switch is 1024 or moreadvantageously, 512 and even more advantageously, 256 and even moreadvantageously 128 samples so that, at the maximum, the second switchcan switch 16 times when the first switch switches only a single time. Amaximum block length ratio, however, is 4:1.

In a further embodiment, the controller 300, 525 is operative to performa speech music discrimination for the first switch in such a way that adecision to speech is favored with respect to a decision to music. Inthis embodiment, a decision to speech is taken even when a portion lessthan 50% of a frame for the first switch is speech and the portion ofmore than 50% of the frame is music.

Furthermore, the controller is operative to already switch to the speechmode, when a quite small portion of the first frame is speech and,specifically, when a portion of the first frame is speech, which is 50%of the length of the smaller second frame. Thus, a speech/favouringswitching decision already switches over to speech even when, forexample, only 6% or 12% of a block corresponding to the frame length ofthe first switch is speech.

This procedure is advantageously in order to fully exploit the bit ratesaving capability of the first processing branch, which has a voicedspeech core in one embodiment and to not loose any quality even for therest of the large first frame, which is non-speech due to the fact thatthe second processing branch includes a converter and, therefore, isuseful for audio signals which have non-speech signals as well.Advantageously, this second processing branch includes an overlappingMDCT, which is critically sampled, and which even at small window sizesprovides a highly efficient and aliasing free operation due to the timedomain aliasing cancellation processing such as overlap and add on thedecoder-side. Furthermore, a large block length for the first encodingbranch which is advantageously an AAC-like MDCT encoding branch isuseful, since non-speech signals are normally quite stationary and along transform window provides a high frequency resolution and,therefore, high quality and, additionally, provides a bit rateefficiency due to a psycho acoustically controlled quantization module,which can also be applied to the transform based coding mode in thesecond processing branch of the second coding branch.

Regarding the FIG. 3 d decoder illustration, it is advantageous that thetransmitted signal includes an explicit indicator as side information 4a as illustrated in FIG. 3 e. This side information 4 a is extracted bya bit stream parser not illustrated in FIG. 3 d in order to forward thecorresponding first encoded signal, first processed signal or secondprocessed signal to the correct processor such as the first decodingbranch, the first inverse processing branch or the second inverseprocessing branch in FIG. 3 d. Therefore, an encoded signal not only hasthe encoded/processed signals but also includes side informationrelating to these signals. In other embodiments, however, there can bean implicit signaling which allows a decoder-side bit stream parser todistinguish between the certain signals. Regarding FIG. 3 e, it isoutlined that the first processed signal or the second processed signalis the output of the second coding branch and, therefore, the secondcoded signal.

Advantageously, the first decoding branch and/or the second inverseprocessing branch includes an MDCT transform for converting from thespectral domain to the time domain. To this end, an overlap-adder isprovided to perform a time domain aliasing cancellation functionalitywhich, at the same time, provides a cross fade effect in order to avoidblocking artifacts. Generally, the first decoding branch converts asignal encoded in the fourth domain into the first domain, while thesecond inverse processing branch performs a conversion from the thirddomain to the second domain and the converter subsequently connected tothe first combiner provides a conversion from the second domain to thefirst domain so that, at the input of the combiner 600, only firstdomain signals are there, which represent, in the FIG. 3 d embodiment,the decoded output signal.

FIGS. 4 a and 4 b illustrate two different embodiments, which differ inthe positioning of the switch 200. In FIG. 4 a, the switch 200 ispositioned between an output of the common pre-processing stage 100 andinput of the two encoded branches 400, 500. The FIG. 4 a embodimentmakes sure that the audio signal is input into a single encoding branchonly, and the other encoding branch, which is not connected to theoutput of the common pre-processing stage does not operate and,therefore, is switched off or is in a sleep mode. This embodiment isadvantageous in that the non-active encoding branch does not consumepower and computational resources which is useful for mobileapplications in particular, which are battery-powered and, therefore,have the general limitation of power consumption.

On the other hand, however, the FIG. 4 b embodiment may be advantageouswhen power consumption is not an issue. In this embodiment, bothencoding branches 400, 500 are active all the time, and only the outputof the selected encoding branch for a certain time portion and/or acertain frequency portion is forwarded to the bit stream formatter whichmay be implemented as a bit stream multiplexer 800. Therefore, in theFIG. 4 b embodiment, both encoding branches are active all the time, andthe output of an encoding branch which is selected by the decision stage300 is entered into the output bit stream, while the output of the othernon-selected encoding branch 400 is discarded, i.e., not entered intothe output bit stream, i.e., the encoded audio signal.

Advantageously, the second encoding rule/decoding rule is an LPC-basedcoding algorithm. In LPC-based speech coding, a differentiation betweenquasi-periodic impulse-like excitation signal segments or signalportions, and noise-like excitation signal segments or signal portions,is made. This is performed for very low bit rate LPC vocoders (2.4 kbps)as in FIG. 7 b. However, in medium rate CELP coders, the excitation isobtained for the addition of scaled vectors from an adaptive codebookand a fixed codebook.

Quasi-periodic impulse-like excitation signal segments, i.e., signalsegments having a specific pitch are coded with different mechanismsthan noise-like excitation signals. While quasi-periodic impulse-likeexcitation signals are connected to voiced speech, noise-like signalsare related to unvoiced speech.

Exemplarily, reference is made to FIGS. 5 a to 5 d. Here, quasi-periodicimpulse-like signal segments or signal portions and noise-like signalsegments or signal portions are exemplarily discussed. Specifically, avoiced speech as illustrated in FIG. 5 a in the time domain and in FIG.5 b in the frequency domain is discussed as an example for aquasi-periodic impulse-like signal portion, and an unvoiced speechsegment as an example for a noise-like signal portion is discussed inconnection with FIGS. 5 c and 5 d. Speech can generally be classified asvoiced, unvoiced, or mixed. Time-and-frequency domain plots for sampledvoiced and unvoiced segments are shown in FIGS. 5 a to 5 d. Voicedspeech is quasi periodic in the time domain and harmonically structuredin the frequency domain, while unvoiced speed is random-like andbroadband. The short-time spectrum of voiced speech is characterized byits fine harmonic formant structure. The fine harmonic structure is aconsequence of the quasi-periodicity of speech and may be attributed tothe vibrating vocal chords. The formant structure (spectral envelope) isdue to the interaction of the source and the vocal tracts. The vocaltracts consist of the pharynx and the mouth cavity. The shape of thespectral envelope that “fits” the short time spectrum of voiced speechis associated with the transfer characteristics of the vocal tract andthe spectral tilt (6 dB/Octave) due to the glottal pulse. The spectralenvelope is characterized by a set of peaks which are called formants.The formants are the resonant modes of the vocal tract. For the averagevocal tract there are three to five formants below 5 kHz. The amplitudesand locations of the first three formants, usually occurring below 3 kHzare quite important both, in speech synthesis and perception. Higherformants are also important for wide band and unvoiced speechrepresentations. The properties of speech are related to the physicalspeech production system as follows. Voiced speech is produced byexciting the vocal tract with quasi-periodic glottal air pulsesgenerated by the vibrating vocal chords. The frequency of the periodicpulses is referred to as the fundamental frequency or pitch.

Unvoiced speech is produced by forcing air through a constriction in thevocal tract. Nasal sounds are due to the acoustic coupling of the nasaltract to the vocal tract, and plosive sounds are produced by abruptlyreleasing the air pressure which was built up behind the closure in thetract.

Thus, a noise-like portion of the audio signal shows neither anyimpulse-like time-domain structure nor harmonic frequency-domainstructure as illustrated in FIG. 5 c and in FIG. 5 d, which is differentfrom the quasi-periodic impulse-like portion as illustrated for examplein FIG. 5 a and in FIG. 5 b. As will be outlined later on, however, thedifferentiation between noise-like portions and quasi-periodicimpulse-like portions can also be observed after a LPC for theexcitation signal. The LPC is a method which models the vocal tract andextracts from the signal the excitation of the vocal tracts.

Furthermore, quasi-periodic impulse-like portions and noise-likeportions can occur in a timely manner, i.e., which means that a portionof the audio signal in time is noisy and another portion of the audiosignal in time is quasi-periodic, i.e. tonal. Alternatively, oradditionally, the characteristic of a signal can be different indifferent frequency bands. Thus, the determination, whether the audiosignal is noisy or tonal, can also be performed frequency-selective sothat a certain frequency band or several certain frequency bands areconsidered to be noisy and other frequency bands are considered to betonal. In this case, a certain time portion of the audio signal mightinclude tonal components and noisy components.

FIG. 7 a illustrates a linear model of a speech production system. Thissystem assumes a two-stage excitation, i.e., an impulse-train for voicedspeech as indicated in FIG. 7 c, and a random-noise for unvoiced speechas indicated in FIG. 7 d. The vocal tract is modelled as an all-polefilter 70 which processes pulses of FIG. 7 c or FIG. 7 d, generated bythe glottal model 72. Hence, the system of FIG. 7 a can be reduced to anall pole-filter model of FIG. 7 b having a gain stage 77, a forward path78, a feedback path 79, and an adding stage 80. In the feedback path 79,there is a prediction filter 81, and the whole source-model synthesissystem illustrated in FIG. 7 b can be represented using z-domainfunctions as follows:S(z)=g/(1−A(z))·X(z),where g represents the gain, A(z) is the prediction filter as determinedby an LP analysis, X(z) is the excitation signal, and S(z) is thesynthesis speech output.

FIGS. 7 c and 7 d give a graphical time domain description of voiced andunvoiced speech synthesis using the linear source system model. Thissystem and the excitation parameters in the above equation are unknownand have to be determined from a finite set of speech samples. Thecoefficients of A(z) are obtained using a linear prediction of the inputsignal and a quantization of the filter coefficients. In a p-th orderforward linear predictor, the present sample of the speech sequence ispredicted from a linear combination of p passed samples. The predictorcoefficients can be determined by well-known algorithms such as theLevinson-Durbin algorithm, or generally an autocorrelation method or areflection method.

FIG. 7 e illustrates a more detailed implementation of the LPC analysisblock 510. The audio signal is input into a filter determination blockwhich determines the filter information A(z). This information is outputas the short-term prediction information needed for a decoder. Theshort-term prediction information is needed by the actual predictionfilter 85. In a subtracter 86, a current sample of the audio signal isinput and a predicted value for the current sample is subtracted so thatfor this sample, the prediction error signal is generated at line 84. Asequence of such prediction error signal samples is very schematicallyillustrated in FIG. 7 c or 7 d. Therefore, FIG. 7 a, 7 b can beconsidered as a kind of a rectified impulse-like signal.

While FIG. 7 e illustrates a way to calculate the excitation signal,FIG. 7 f illustrates a way to calculate the weighted signal. In contrastto FIG. 7 e, the filter 85 is different, when γ is different from 1. Avalue smaller than 1 is advantageous for γ. Furthermore, the block 87 ispresent, and μ is advantageously a number smaller than 1. Generally, theelements in FIGS. 7 e and 7 f can be implemented as in 3GPP TS 26.190 or3GPP TS 26.290.

FIG. 7 g illustrates an inverse processing, which can be applied on thedecoder side such as in element 537 of FIG. 2 b. Particularly, block 88generates an unweighted signal from the weighted signal and block 89calculates an excitation from the unweighted signal. Generally, allsignals but the unweighted signal in FIG. 7 g are in the LPC domain, butthe excitation signal and the weighted signal are different signals inthe same domain. Block 89 outputs an excitation signal which can then beused together with the output of block 536. Then, the common inverse LPCtransform can be performed in block 540 of FIG. 2 b.

Subsequently, an analysis-by-synthesis CELP encoder will be discussed inconnection with FIG. 6 in order to illustrate the modifications appliedto this algorithm. This CELP encoder is discussed in detail in “SpeechCoding: A Tutorial Review”, Andreas Spanias, Proceedings of the IEEE,Vol. 82, No. 10, October 1994, pages 1541-1582. The CELP encoder asillustrated in FIG. 6 includes a long-term prediction component 60 and ashort-term prediction component 62. Furthermore, a codebook is usedwhich is indicated at 64. A perceptual weighting filter W(z) isimplemented at 66, and an error minimization controller is provided at68. s(n) is the time-domain input signal. After having been perceptuallyweighted, the weighted signal is input into a subtracter 69, whichcalculates the error between the weighted synthesis signal at the outputof block 66 and the original weighted signal s_(w)(n). Generally, theshort-term prediction filter coefficients A(z) are calculated by an LPanalysis stage and its coefficients are quantized in Â(z) as indicatedin FIG. 7 e. The long-term prediction information A_(L)(z) including thelong-term prediction gain g and the vector quantization index, i.e.,codebook references are calculated on the prediction error signal at theoutput of the LPC analysis stage referred as 10 a in FIG. 7 e. The LTPparameters are the pitch delay and gain. In CELP this is usuallyimplemented as an adaptive codebook containing the past excitationsignal (not the residual). The adaptive CB delay and gain are found byminimizing the mean-squared weighted error (closed-loop pitch search).

The CELP algorithm encodes then the residual signal obtained after theshort-term and long-term predictions using a codebook of for exampleGaussian sequences. The ACELP algorithm, where the “A” stands for“Algebraic” has a specific algebraically designed codebook.

A codebook may contain more or less vectors where each vector is somesamples long. A gain factor g scales the code vector and the gained codeis filtered by the long-term prediction synthesis filter and theshort-term prediction synthesis filter. The “optimum” code vector isselected such that the perceptually weighted mean square error at theoutput of the subtracter 69 is minimized. The search process in CELP isdone by an analysis-by-synthesis optimization as illustrated in FIG. 6.

For specific cases, when a frame is a mixture of unvoiced and voicedspeech or when speech over music occurs, a TCX coding can be moreappropriate to code the excitation in the LPC domain. The TCX codingprocesses the weighted signal in the frequency domain without doing anyassumption of excitation production. The TCX is then more generic thanCELP coding and is not restricted to a voiced or a non-voiced sourcemodel of the excitation. TCX is still a source-oriented model codingusing a linear predictive filter for modelling the formants of thespeech-like signals.

In the AMR-WB+-like coding, a selection between different TCX modes andACELP takes place as known from the AMR-WB+ description. The TCX modesare different in that the length of the block-wise Discrete FourierTransform is different for different modes and the best mode can beselected by an analysis by synthesis approach or by a direct“feedforward” mode.

As discussed in connection with FIGS. 2 a and 2 b, the commonpre-processing stage 100 advantageously includes a joint multi-channel(surround/joint stereo device) 101 and, additionally, a band widthextension stage 102. Correspondingly, the decoder includes a band widthextension stage 701 and a subsequently connected joint multichannelstage 702. Advantageously, the joint multichannel stage 101 is, withrespect to the encoder, connected before the band width extension stage102, and, on the decoder side, the band width extension stage 701 isconnected before the joint multichannel stage 702 with respect to thesignal processing direction. Alternatively, however, the commonpre-processing stage can include a joint multichannel stage without thesubsequently connected bandwidth extension stage or a bandwidthextension stage without a connected joint multichannel stage.

An example for a joint multichannel stage on the encoder side 101 a, 101b and on the decoder side 702 a and 702 b is illustrated in the contextof FIG. 8. A number of E original input channels is input into thedownmixer 101 a so that the downmixer generates a number of Ktransmitted channels, where the number K is greater than or equal to oneand is smaller than or equal E.

Advantageously, the E input channels are input into a joint multichannelparameter analyzer 101 b which generates parametric information. Thisparametric information is advantageously entropy-encoded such as by adifference encoding and subsequent

Huffman encoding or, alternatively, subsequent arithmetic encoding. Theencoded parametric information output by block 101 b is transmitted to aparameter decoder 702 b which may be part of item 702 in FIG. 2 b. Theparameter decoder 702 b decodes the transmitted parametric informationand forwards the decoded parametric information into the upmixer 702 a.The upmixer 702 a receives the K transmitted channels and generates anumber of L output channels, where the number of L is greater than orequal K and lower than or equal to E.

Parametric information may include inter channel level differences,inter channel time differences, inter channel phase differences and/orinter channel coherence measures as is known from the BCC technique oras is known and is described in detail in the MPEG surround standard.The number of transmitted channels may be a single mono channel forultra-low bit rate applications or may include a compatible stereoapplication or may include a compatible stereo signal, i.e., twochannels. Typically, the number of E input channels may be five or maybeeven higher. Alternatively, the number of E input channels may also be Eaudio objects as it is known in the context of spatial audio objectcoding (SAOC).

In one implementation, the downmixer performs a weighted or unweightedaddition of the original E input channels or an addition of the E inputaudio objects. In case of audio objects as input channels, the jointmultichannel parameter analyzer 101 b will calculate audio objectparameters such as a correlation matrix between the audio objectsadvantageously for each time portion and even more advantageously foreach frequency band. To this end, the whole frequency range may bedivided in at least 10 and advantageously 32 or 64 frequency bands.

FIG. 9 illustrates an embodiment for the implementation of the bandwidthextension stage 102 in FIG. 2 a and the corresponding band widthextension stage 701 in FIG. 2 b. On the encoder-side, the bandwidthextension block 102 advantageously includes a low pass filtering block102 b, a downsampler block, which follows the lowpass, or which is partof the inverse QMF, which acts on only half of the QMF bands, and a highband analyzer 102 a. The original audio signal input into the bandwidthextension block 102 is low-pass filtered to generate the low band signalwhich is then input into the encoding branches and/or the switch. Thelow pass filter has a cut off frequency which can be in a range of 3 kHzto 10 kHz. Furthermore, the bandwidth extension block 102 furthermoreincludes a high band analyzer for calculating the bandwidth extensionparameters such as a spectral envelope parameter information, a noisefloor parameter information, an inverse filtering parameter information,further parametric information relating to certain harmonic lines in thehigh band and additional parameters as discussed in detail in the MPEG-4standard in the chapter related to spectral band replication.

On the decoder-side, the bandwidth extension block 701 includes apatcher 701 a, an adjuster 701 b and a combiner 701 c. The combiner 701c combines the decoded low band signal and the reconstructed andadjusted high band signal output by the adjuster 701 b. The input intothe adjuster 701 b is provided by a patcher which is operated to derivethe high band signal from the low band signal such as by spectral bandreplication or, generally, by bandwidth extension. The patchingperformed by the patcher 701 a may be a patching performed in a harmonicway or in a non-harmonic way. The signal generated by the patcher 701 ais, subsequently, adjusted by the adjuster 701 b using the transmittedparametric bandwidth extension information.

As indicated in FIG. 8 and FIG. 9, the described blocks may have a modecontrol input in an embodiment. This mode control input is derived fromthe decision stage 300 output signal. In such an embodiment, acharacteristic of a corresponding block may be adapted to the decisionstage output, i.e., whether, in an embodiment, a decision to speech or adecision to music is made for a certain time portion of the audiosignal. Advantageously, the mode control only relates to one or more ofthe functionalities of these blocks but not to all of thefunctionalities of blocks. For example, the decision may influence onlythe patcher 701 a but may not influence the other blocks in FIG. 9, ormay, for example, influence only the joint multichannel parameteranalyzer 101 b in FIG. 8 but not the other blocks in FIG. 8. Thisimplementation is advantageously such that a higher flexibility andhigher quality and lower bit rate output signal is obtained by providingflexibility in the common pre-processing stage. On the other hand,however, the usage of algorithms in the common pre-processing stage forboth kinds of signals allows to implement an efficient encoding/decodingscheme.

FIG. 10 a and FIG. 10 b illustrates two different implementations of thedecision stage 300. In FIG. 10 a, an open loop decision is indicated.Here, the signal analyzer 300 a in the decision stage has certain rulesin order to decide whether the certain time portion or a certainfrequency portion of the input signal has a characteristic whichrequests that this signal portion is encoded by the first encodingbranch 400 or by the second encoding branch 500. To this end, the signalanalyzer 300 a may analyze the audio input signal into the commonpre-processing stage or may analyze the audio signal output by thecommon pre-processing stage, i.e., the audio intermediate signal or mayanalyze an intermediate signal within the common pre-processing stagesuch as the output of the downmix signal which may be a mono signal orwhich may be a signal having k channels indicated in FIG. 8. On theoutput-side, the signal analyzer 300 a generates the switching decisionfor controlling the switch 200 on the encoder-side and the correspondingswitch 600 or the combiner 600 on the decoder-side.

Although not discussed in detail for the second switch 521, it is to beemphasized that the second switch 521 can be positioned in a similar wayas the first switch 200 as discussed in connection with FIG. 4 a andFIG. 4 b. Thus, an alternative position of switch 521 in FIG. 3 c is atthe output of both processing branches 522, 523, 524 so that, bothprocessing branches operate in parallel and only the output of oneprocessing branch is written into a bit stream via a bit stream formerwhich is not illustrated in FIG. 3 c.

Furthermore, the second combiner 600 may have a specific cross fadingfunctionality as discussed in FIG. 4 c. Alternatively or additionally,the first combiner 532 might have the same cross fading functionality.Furthermore, both combiners may have the same cross fading functionalityor may have different cross fading functionalities or may have no crossfading functionalities at all so that both combiners are switcheswithout any additional cross fading functionality.

As discussed before, both switches can be controlled via an open loopdecision or a closed loop decision as discussed in connection with FIG.10 a and FIG. 10 b, where the controller 300, 525 of FIG. 3 c can havedifferent or the same functionalities for both switches.

Furthermore, a time warping functionality which is signal-adaptive canexist not only in the first encoding branch or first decoding branch butcan also exist in the second processing branch of the second codingbranch on the encoder side as well as on the decoder side. Depending ona processed signal, both time warping functionalities can have the sametime warping information so that the same time warp is applied to thesignals in the first domain and in the second domain. This savesprocessing load and might be useful in some instances, in cases wheresubsequent blocks have a similar time warping time characteristic. Inalternative embodiments, however, it is advantageous to have independenttime warp estimators for the first coding branch and the secondprocessing branch in the second coding branch.

The inventive encoded audio signal can be stored on a digital storagemedium or can be transmitted on a transmission medium such as a wirelesstransmission medium or a wired transmission medium such as the Internet.

In a different embodiment, the switch 200 of FIG. 1 a or 2 a switchesbetween the two coding branches 400, 500. In a further embodiment, therecan be additional encoding branches such as a third encoding branch oreven a fourth encoding branch or even more encoding branches. On thedecoder side, the switch 600 of FIG. 1 b or 2 b switches between the twodecoding branches 431, 440 and 531, 532, 533, 534, 540. In a furtherembodiment, there can be additional decoding branches such as a thirddecoding branch or even a fourth decoding branch or even more decodingbranches. Similarly, the other switches 521 or 532 may switch betweenmore than two different coding algorithms, when such additionalcoding/decoding branches are provided.

FIG. 12A illustrates an embodiment of an encoder implementation, andFIG. 12B illustrates an embodiment of the corresponding decoderimplementation. In addition to the elements discussed before withrespect to corresponding reference numbers, the embodiment of FIG. 12Aillustrates a separate psychoacoustic module 1200, and additionally,illustrates an implementation of the further encoder tools illustratedat block 421 in FIG. 11A. These additional tools are a temporal noiseshaping (TNS) tool 1201 and a mid/side coding tool (M/S) 1202.Furthermore, additional functionalities of the elements 421 and 524 areillustrated in block 421/542 as a combined implementation of scaling,noise filling analysis, quantization, arithmetic coding of spectralvalues.

In the corresponding decoder implementation FIG. 12B, additionalelements are illustrated, which are an M/S decoding tool 1203 and aTNS-decoder tool 1204. Furthermore, a bass postfilter not illustrated inthe preceding figures is indicated at 1205. The transition windowingblock 532 corresponds to the element 532 in FIG. 2B, which isillustrated as a switch, but which performs a kind of a cross fadingwhich can either be an over sampled cross fading or a critically sampledcross fading. The latter one is implemented as an MDCT operation, wheretwo time aliased portions are overlapped and added. This criticallysampled transition processing is advantageously used where appropriate,since the overall bitrate can be reduced without any loss in quality.The additional transition windowing block 600 corresponds to thecombiner 600 in FIG. 2B, which is again illustrated as a switch, but itis clear that this element performs a kind of cross fading eithercritically sampled or non-critically sampled in order to avoid blockingartifacts, and specifically switching artifacts, when one block has beenprocessed in the first branch and the other block has been processed inthe second branch. When however, the processing in both branches isperfectly matched to its other, then the cross fading operation can“degrade” to a hard switch, while a cross fading operation is understoodto be a “soft” switching between both branches.

The concept in FIGS. 12A and 12B permits coding of signals having anarbitrary mix of speech and audio content, and this concept performscomparable to or better than the best coding technology that might betailored specifically to coding of either speech or general audiocontent. The general structure of the encoder and decoder can bedescribed in that there is a common pre-post processing consisting of anMPEG surround (MPEGS) functional unit to handle stereo or multi-channelprocessing and an enhanced SBR (eSBR) unit, which handles the parametricrepresentation of the higher audio frequencies in the input signal.Then, there are two branches, one consisting of a modified advancedaudio coding (AAC) tool path and the other consisting of a linearprediction coding (LP or LPC domain) based path, which in turn featureseither a frequency domain representation or a time domain representationof the LPC residual. All transmitted spectra for both, AAC and LPC, arerepresented in MDCT domain following quantization and arithmetic coding.The time domain representation uses an ACELP excitation coding scheme.The basic structure is shown in FIG. 12A for the encoder and FIG. 12Bfor the decoder. The data flow in this diagram is from left to right,top to bottom. The functions of the decoder are to find the descriptionof the quantized audio spectral or time domain representation in thebitstream payload and decode the quantized values and otherreconstruction information.

In case of transmitted spectral information the decoder shallreconstruct the quantized spectra, process the reconstructed spectrathrough whatever tools are active in the bitstream payload in order toarrive at the actual signal spectra as described by the input bitstreampayload, and finally convert the frequency domain spectra to the timedomain. Following the initial reconstruction and scaling of the spectrumreconstruction, there are optional tools that modify one or more of thespectra in order to provide more efficient coding.

In case of a transmitted time domain signal representation, the decodershall reconstruct the quantized time signal, process the reconstructedtime signal through whatever tools are active in the bitstream payloadin order to arrive at the actual time domain signal as described by theinput bitstream payload.

For each of the optional tools that operate on the signal data, theoption to “pass through” is retained, and in all cases where theprocessing is omitted, the spectra or time samples at its input arepassed directly through the tool without modification.

In places where the bitstream changes its signal representation fromtime domain to frequency domain representation or from LP domain tonon-LP domain or vice versa, the decoder shall facilitate the transitionfrom one domain to the other by means of an appropriate transitionoverlap-add windowing.

eSBR and MPEGS processing is applied in the same manner to both codingpaths after transition handling.

The input to the bitstream payload demultiplexer tool is a bitstreampayload. The demultiplexer separates the bitstream payload into theparts for each tool, and provides each of the tools with the bitstreampayload information related to that tool.

The outputs from the bitstream payload demultiplexer tool are:

-   -   Depending on the core coding type in the current frame either:        -   the quantized and noiselessly coded spectra represented by            -   scalefactor information            -   arithmetically coded spectral lines        -   or: linear prediction (LP) parameters together with an            excitation signal represented by either:            -   quantized and arithmetically coded spectral lines                (transform coded excitation, TCX) or            -   ACELP coded time domain excitation    -   The spectral noise filling information (optional)    -   The M/S decision information (optional)    -   The temporal noise shaping (TNS) information (optional)    -   The filterbank control information    -   The time unwarping (TW) control information (optional)    -   The enhanced spectral bandwidth replication (eSBR) control        information    -   The MPEG Surround (MPEGS) control information

The scalefactor noiseless decoding tool takes information from thebitstream payload demultiplexer, parses that information, and decodesthe Huffman and DPCM coded scalefactors.

The input to the scalefactor noiseless decoding tool is:

-   -   The scalefactor information for the noiselessly coded spectra

The output of the scalefactor noiseless decoding tool is:

-   -   The decoded integer representation of the scalefactors:

The spectral noiseless decoding tool takes information from thebitstream payload demultiplexer, parses that information, decodes thearithmetically coded data, and reconstructs the quantized spectra. Theinput to this noiseless decoding tool is:

-   -   The noiselessly coded spectra

The output of this noiseless decoding tool is:

-   -   The quantized values of the spectra

The inverse quantizer tool takes the quantized values for the spectra,and converts the integer values to the non-scaled, reconstructedspectra. This quantizer is a companding quantizer, whose compandingfactor depends on the chosen core coding mode.

The input to the Inverse Quantizer tool is:

-   -   The quantized values for the spectra

The output of the inverse quantizer tool is:

-   -   The un-scaled, inversely quantized spectra

The noise filling tool is used to fill spectral gaps in the decodedspectra, which occur when spectral value are quantized to zero e.g. dueto a strong restriction on bit demand in the encoder. The use of thenoise filling tool is optional.

The inputs to the noise filling tool are:

-   -   The un-scaled, inversely quantized spectra    -   Noise filling parameters    -   The decoded integer representation of the scalefactors

The outputs to the noise filling tool are:

-   -   The un-scaled, inversely quantized spectral values for spectral        lines which were previously quantized to zero.    -   Modified integer representation of the scalefactors

The rescaling tool converts the integer representation of thescalefactors to the actual values, and multiplies the un-scaledinversely quantized spectra by the relevant scalefactors.

The inputs to the scalefactors tool are:

-   -   The decoded integer representation of the scalefactors    -   The un-scaled, inversely quantized spectra

The output from the scalefactors tool is:

-   -   The scaled, inversely quantized spectra

For an overview over the M/S tool, please refer to ISO/IEC 14496-3,subpart 4.1.1.2.

For an overview over the temporal noise shaping (TNS) tool, please referto ISO/IEC 14496-3, subpart 4.1.1.2.

The filterbank/block switching tool applies the inverse of the frequencymapping that was carried out in the encoder. An inverse modifieddiscrete cosine transform (IMDCT) is used for the filterbank tool. TheIMDCT can be configured to support 120, 128, 240, 256, 320, 480, 512,576, 960, 1024 or 1152 spectral coefficients.

The inputs to the filterbank tool are:

-   -   The (inversely quantized) spectra    -   The filterbank control information

The output(s) from the filterbank tool is (are):

-   -   The time domain reconstructed audio signal(s).

The time-warped filterbank/block switching tool replaces the normalfilterbank/block switching tool when the time warping mode is enabled.The filterbank is the same (IMDCT) as for the normal filterbank,additionally the windowed time domain samples are mapped from the warpedtime domain to the linear time domain by time-varying resampling.

The inputs to the time-warped filterbank tools are:

-   -   The inversely quantized spectra    -   The filterbank control information    -   The time-warping control information

The output(s) from the filterbank tool is (are):

-   -   The linear time domain reconstructed audio signal(s).

The enhanced SBR (eSBR) tool regenerates the highband of the audiosignal. It is based on replication of the sequences of harmonics,truncated during encoding. It adjusts the spectral envelope of thegenerated high-band and applies inverse filtering, and adds noise andsinusoidal components in order to recreate the spectral characteristicsof the original signal.

The input to the eSBR tool is:

-   -   The quantized envelope data    -   Misc. control data    -   a time domain signal from the AAC core decoder

The output of the eSBR tool is either:

-   -   a time domain signal or    -   a QMF-domain representation of a signal, e.g. in case the MPEG        Surround tool is used.

The MPEG Surround (MPEGS) tool produces multiple signals from one ormore input signals by applying a sophisticated upmix procedure to theinput signal(s) controlled by appropriate spatial parameters. In theUSAC context MPEGS is used for coding a multichannel signal, bytransmitting parametric side information alongside a transmitteddownmixed signal.

The input to the MPEGS tool is:

-   -   a downmixed time domain signal or    -   a QMF-domain representation of a downmixed signal from the eSBR        tool

The output of the MPEGS tool is:

-   -   a multi-channel time domain signal

The Signal Classifier tool analyses the original input signal andgenerates from it control information which triggers the selection ofthe different coding modes. The analysis of the input signal isimplementation dependent and will try to choose the optimal core codingmode for a given input signal frame. The output of the signal classifiercan (optionally) also be used to influence the behaviour of other tools,for example MPEG Surround, enhanced SBR, time-warped filterbank andothers.

The input to the Signal Classifier tool is:

-   -   the original unmodified input signal    -   additional implementation dependent parameters

The output of the Signal Classifier tool is:

-   -   a control signal to control the selection of the core codec        (non-LP filtered frequency domain coding, LP filtered frequency        domain or LP filtered time domain coding)

In accordance with the present invention, the time/frequency resolutionin block 410 in FIG. 12A and in the converter 523 in FIG. 12A iscontrolled dependent on the audio signal.

The interrelation between window length, transform length, timeresolution and frequency resolution is illustrated in FIG. 13A, where itbecomes clear that, for a long window length, the time resolution getslow, but the frequency resolution gets high, and for a short windowlength, the time resolution is high, but the frequency resolution islow.

In the first encoding branch, which is advantageously the AAC encodingbranch indicated by elements 410, 1201, 1202, 4021 of FIG. 12A,different windows can be used, where the window shape is determined by asignal analyzer which is advantageously encoded in the signal classifierblock 300, but which can also be a separate module. The encoder selectsone of the windows illustrated in FIG. 13B, which have differenttime/frequency resolutions. The time/frequency resolution of the firstlong window, the second window, the fourth window, the fifth window andthe sixth window are equal to 2,048 sampling values to a transformlength of 1,024. The short window illustrated in the third line in FIG.13B has a time resolution of 256 sampling values corresponding to thewindow size. This corresponds to a transform length of 128.

Analogously, the last two windows have a window length equal to 2,304,which is a better frequency resolution than the window in the first linebut a lower time resolution. The transform length of the windows in thelast two lines is equal to 1,152.

In the first encoding branch, different window sequences which are builtfrom the transform windows in the FIG. 13B can be constructed. Althoughin FIG. 13C only a short sequence is illustrated, while the other“sequences” consist of a single window only, larger sequences consistingof more windows can also be constructed. It is noted that according toFIG. 13B, for the smaller number of coefficients, i.e., 960 instead of1,024, the time resolution is also lower than for the correspondinghigher number of coefficients such as 1024.

FIG. 14A-14G illustrates different resolutions/window sizes in thesecond encoding branch. In an embodiment of the present invention, thesecond encoding branch has a first processing branch which is an ACELPtime domain coder 526, and the second processing branch comprises thefilterbank 523. In this branch, a super frame of, for example 2048samples, is sub-divided into frames of 256 samples. Individual frames of256 samples can be separately used so that a sequence of four windows,each window covering two frames, can be applied when an MDCT with 50percents overlap is applied. Then, a high time resolution is used asillustrated in FIG. 14D. Alternatively, when the signal allows longerwindows, the sequence as in FIG. 14C can be applied, where a doublewindow size having 1,024 samples for each window (medium windows) isapplied, so that one window covers four frames and there is an overlapof 50 percent.

Finally, when the signal is such that a long window can be used, thislong window extends over 4,096 samples again with a 50 percent overlap.

In the embodiment, in which there are two branches, where one branch hasan ACELP encoder, the position of the ACELP frame indicated by “A” inthe super frame also may determine the window size applied for twoadjacent TCX frames indicated by “T” in FIG. 14E. Basically, one isinterested in using long windows whenever possible. Nevertheless, shortwindows have to be applied when a single T frame is between two Aframes. Medium windows can be applied when there are two adjacent Tframes. However, when there are three adjacent T frames, a correspondinglarger window might not be efficient due to the additional complexity.Therefore, the third T frame, although not preceded by an A frame can beprocessed by a short window. When the whole super frame only has Tframes then a long window can be applied.

FIG. 14F illustrates several alternatives for windows, where the windowsize is 2× the number lg of spectral coefficients due to 50 percentoverlap. However, other overlap percentages for all encoding branchescan be applied so that the relation between window size and transformlength can also be different from two and even approach one, when notime domain aliasing is applied.

FIG. 14G illustrates rules for constructing a window based on rulesgiven in FIG. 14F. The value ZL illustrates zeroes at the beginning ofthe window. The value L illustrates a number of window coefficients inan aliasing zone. The values in portion M are “1” values not introducingany aliasing due to an overlap with an adjacent window which has zerovalues in the portion corresponding to M. The portion M is followed by aright overlap zone R, which is followed by a ZR zone of zeros, whichwould correspond to a portion M of a subsequent window.

Reference is made to the subsequently attached annex, which describes anadvantageous and detailed implementation of an inventive audioencoding/decoding scheme, particularly with respect to the decoder-side.

Annex

1. Windows and Window Sequences

Quantization and coding is done in the frequency domain. For thispurpose, the time signal is mapped into the frequency domain in theencoder. The decoder performs the inverse mapping as described insubclause 2. Depending on the signal, the coder may change thetime/frequency resolution by using three different windows size: 2304,2048 and 256. To switch between windows, the transition windowsLONG_START_WINDOW, LONG_STOP_WINDOW, START_WINDOW_LPD,STOP_WINDOW_(—)1152, STOP_START_WINDOW and STOP_START_WINDOW_(—)1152 areused. Table 5.11 lists the windows, specifies the correspondingtransform length and shows the shape of the windows schematically. Threetransform lengths are used: 1152, 1024 (or 960) (referred to as longtransform) and 128 (or 120) coefficients (referred to as shorttransform).

Window sequences are composed of windows in a way that a raw_data_blockcontains data representing 1024 (or 960) output samples. The dataelement window sequence indicates the window sequence that is actuallyused. FIG. 13C lists how the window sequences are composed of individualwindows. Refer to subclause 2 for more detailed information about thetransform and the windows.

1.2 Scalefactor Bands and Grouping

See ISO/IEC 14496-3, subpart 4, subclause 4.5.2.3.4

As explain in ISO/IEC 14496-3, subpart 4, subclause 4.5.2.3.4, the widthof the scalefactor bands is built in imitation of the critical bands ofthe human auditory system. For that reason the number of scalefactorbands in a spectrum and their width depend on the transform length andthe sampling frequency. Table 4.110 to Table 4.128, in ISO/IEC 14496-3,subpart 4, section 4.5.4, list the offset to the beginning of eachscalefactor band on the transform lengths 1024 (960) and 128 (120) andon the sampling frequencies. The tables originally designed forLONG_WINDOW, LONG_START_WINDOW and LONG_STOP_WINDOW are used also forSTART_WINDOW_LPD and STOP_START_WINDOW. The offset tables forSTOP_WINDOW_(—)1152 and STOP_START_WINDOW_(—)1152 are Table 4 to Table10.

1.3 Decoding of lpd_Channel_Stream( )

The lpd_channel_stream( ) bitstream element contains all neededinformation to decode one frame of “linear prediction domain” codedsignal. It contains the payload for one frame of encoded signal whichwas coded in the LPC-domain, i.e. including an LPC filtering step. Theresidual of this filter (so-called “excitation”) is then representedeither with the help of an ACELP module or in the MDCT transform domain(“transform coded excitation”, TCX). To allow close adaptation to thesignal characteristics, one frame is broken down in to four smallerunits of equal size, each of which is coded either with ACELP or TCXcoding scheme.

This process is similar to the coding scheme described in 3GPP TS26.290. Inherited from this document is a slightly differentterminology, where one “superframe” signifies a signal segment of 1024samples, whereas a “frame” is exactly one fourth of that, i.e. 256samples. Each one of these frames is further subdivided into four“subframes” of equal length. Please note that this subchapter adoptsthis terminology

1.4 Definitions, Data Elements

-   acelp_core_mode This bitfield indicates the exact bit allocation    scheme in case ACELP is used as a lpd coding mode.-   Ipd_mode The bit-field mode defines the coding modes for each of the    four frames within one superframe of the lpd_channel_stream( )    (corresponds to one AAC frame). The coding modes are stored in the    array mod[ ] and can take values from 0 to 3. The mapping from    Ipd_mode to mod[ ] can be determined from Table 1 below.

TABLE 1 Mapping of coding modes for lpd_channel_stream( ) meaning ofbits in bit-field mode remaining lpd_mode bit 4 bit 3 bit 2 bit 1 bit 0mod[ ] entries  0 . . . 15 0 mod[3] mod[2] mod[1] mod[0] 16 . . . 19 1 00 mod[3] mod[2] mod[1] = 2 mod[0] = 2 20 . . . 23 1 0 1 mod[1] mod[0]mod[3] = 2 mod[2] = 2 24 1 1 0 0 0 mod[3] = 2 mod[2] = 2 mod[1] = 2mod[0] = 2 25 1 1 0 0 1 mod[3] = 3 mod[2] = 3 mod[1] = 3 mod[0] = 3 26 .. . 31 reserved mod[0 . . . 3] The values in the array mod[ ] indicatethe respective coding modes in each frame:

TABLE 2 Coding modes indicated by mod[ ] value of bitstream mod[x]coding mode in frame element 0 ACELP acelp_coding( ) 1 one frame of TCXtcx_coding( ) 2 TCX covering half a tcx_coding( ) superframe 3 TCXcovering entire tcx_coding( ) superframe acelp_coding( ) Syntax elementwhich contains all data to decode one frame of ACELP excitation.tcx_coding( ) Syntax element which contains all data to decode one frameof MDCT based transform coded excitation (TCX). first_tcx_flag Flagwhich indicates if the current processed TCX frame is the first in thesuperframe. lpc_data( ) Syntax element which contains all data to decodeall LPC filter parameter sets needed to decode the current superframe.first_lpd_flag Flag which indicates whether the current superframe isthe first of a sequence of superframes which are coded in LPC domain.This flag can also be determined from the history of the bitstreamelement core_mode (core_mode( ) and core_mode1 in case of achannel_pair_element) according to Table 3.

TABLE 3 Definition of first_lpd_flag core_mode core_mode of previousframe of current frame (superframe) (superframe) first_lpd_flag 0 1 1 11 0 last_lpd_mode Indicates the lpd_mode of the previously decodedframe.1.5 Decoding Process

In the lpd_channel_stream the order of decoding is

-   -   Get acelp_core_mode    -   Get lpd_mode and determine from it the content of the helper        variable mod[ ]    -   Get acelp_coding or tcx_coding data depending on the content of        the helper variable mod [ ]    -   Get lpc_data        1.6 ACELP/TCX Coding Mode Combinations

In analogy to [8], section 5.2.2, there are 26 allowed combinations ofACELP or TCX within one superframe of an lpd_channel_stream payload. Oneof these 26 mode combinations is signaled in the bitstream elementlpd_mode. The mapping of lpd_mode to actual coding modes of each framein a subframe is shown in Table 1 and Table 2.

TABLE 4 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 44.1 and 48 kHz fs [kHz]44.1, 48 num_swb_long_ window 49 swb_offset_ swb long_window 0 0 1 4 2 83 12 4 16 5 20 6 24 7 28 8 32 9 36 10 40 11 48 12 56 13 64 14 72 15 8016 88 17 96 18 108 19 120 20 132 21 144 22 160 23 176 24 196 25 216 26240 27 264 28 292 29 320 30 352 31 384 32 416 33 448 34 480 35 512 36544 37 576 38 608 39 640 40 672 41 704 42 736 43 768 44 800 45 832 46864 47 896 48 928 1152

TABLE 5 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 32 kHz fs [kHz] 32num_swb_long_ window 51 swb_offset_ swb long_window 0 0 1 4 2 8 3 12 416 5 20 6 24 7 28 8 32 9 36 10 40 11 48 12 56 13 64 14 72 15 80 16 88 1796 18 108 19 120 20 132 21 144 22 160 23 176 24 196 25 216 26 240 27 26428 292 29 320 30 352 31 384 32 416 33 448 34 480 35 512 36 544 37 576 38608 39 640 40 672 41 704 42 736 43 768 44 800 45 832 46 864 47 896 48928 49 960 50 992 1152

TABLE 6 scalefactor bands for window length of of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 8 kHz fs [kHz] 8num_swb_long_ window 40 swb_offset_ swb long_window 0 0 1 12 2 24 3 36 448 5 60 6 72 7 84 8 96 9 108 10 120 11 132 12 144 13 156 14 172 15 18816 204 17 220 18 236 19 252 20 268 21 288 22 308 23 328 24 348 25 372 26396 27 420 28 448 29 476 30 508 31 544 32 580 33 620 34 664 35 712 36764 37 820 38 880 39 944 1152

TABLE 7 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 11.025, 12 and 16 kHz fs[kHz] 11.025, 12, 16 num_swb_long_ window 43 swb_offset_ swb long_window0 0 1 8 2 16 3 24 4 32 5 40 6 48 7 56 8 64 9 72 10 80 11 88 12 100 13112 14 124 15 136 16 148 17 160 18 172 19 184 20 196 21 212 22 228 23244 24 260 25 280 26 300 27 320 28 344 29 368 30 396 31 424 32 456 33492 34 532 35 572 36 616 37 664 38 716 39 772 40 832 41 896 42 960 1152

TABLE 8 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 22.05 and 24 kHz fs [kHz]22.05 and 24 num_swb_long_ window 47 swb_offset_ swb long_window 0 0 1 42 8 3 12 4 16 5 20 6 24 7 28 8 32 9 36 10 40 11 44 12 52 13 60 14 68 1576 16 84 17 92 18 100 19 108 20 116 21 124 22 136 23 148 24 160 25 17226 188 27 204 28 220 29 240 30 260 31 284 32 308 33 336 34 364 35 396 36432 37 468 38 508 39 552 40 600 41 652 42 704 43 768 44 832 45 896 46960 1152

TABLE 9 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 64 kHz fs [kHz] 64num_swb_long_ window 47 (46) swb_offset_ swb long_window 0 0 1 4 2 8 312 4 16 5 20 6 24 7 28 8 32 9 36 10 40 11 44 12 48 13 52 14 56 15 64 1672 17 80 18 88 19 100 20 112 21 124 22 140 23 156 24 172 25 192 26 21627 240 28 268 29 304 30 344 31 384 32 424 33 464 34 504 35 544 36 584 37624 38 664 39 704 40 744 41 784 42 824 43 864 44 904 45 944 46 984 1152

TABLE 10 scalefactor bands for a window length of 2304 forSTOP_START_1152_WINDOW and STOP_1152_WINDOW at 88.2 and 96 kHz fs [kHz]88.2 and 96 num_swb_long_ window 41 swb_offset_ swb long_window 0 0 1 42 8 3 12 4 16 5 20 6 24 7 28 8 32 9 36 10 40 11 44 12 48 13 52 14 56 1564 16 72 17 80 18 88 19 96 20 108 21 120 22 132 23 144 24 156 25 172 26188 27 212 28 240 29 276 30 320 31 384 32 448 33 512 34 576 35 640 36704 37 768 38 832 39 896 40 960 11521.7 Scale Factor Band Tables References

For all other scalefactor band tables please refer to ISO/IEC 14496-3,subpart 4, section 4.5.4 Table 4.129 to Table 4.147.

1.8 Quantization

For quantization of the AAC spectral coefficients in the encoder a nonuniform quantizer is used. Therefore the decoder has to perform theinverse non uniform quantization after the Huffman decoding of thescalefactors (see subclause 6.3) and the noiseless decoding of thespectral data (see subclause 6.1).

For the quantization of the TCX spectral coefficients, a uniformquantizer is used. No inverse quantization is needed at the decoderafter the noiseless decoding of the spectral data.

2. Filterbank and Block Switching

2.1 Tool Description

The time/frequency representation of the signal is mapped onto the timedomain by feeding it into the filterbank module. This module consists ofan inverse modified discrete cosine transform (IMDCT), and a window andan overlap-add function. In order to adapt the time/frequency resolutionof the filterbank to the characteristics of the input signal, a blockswitching tool is also adopted. N represents the window length, where Nis a function of the window_sequence (see subclause 1.1). For eachchannel, the N/2 time-frequency values X_(i,k) are transformed into theN time domain values x_(i,n) via the IMDCT. After applying the windowfunction, for each channel, the first half of the z_(i,n) sequence isadded to the second half of the previous block windowed sequence toreconstruct the output samples for each channel out_(i,n).

2.2 Definitions

-   window_sequence 2 bit indicating which window sequence (i.e. block    size) is used.-   window_shape 1 bit indicating which window function is selected.

FIG. 13C shows the eight window_sequences (ONLY_LONG_SEQUENCE,LONG_START_SEQUENCE, EIGHT_SHORT_SEQUENCE, LONG_STOP_SEQUENCE,STOP_START_SEQUENCE, STOP_(—)1152_SEQUENCE, LPD_START_SEQUENCE,STOP_START_(—)1152_SEQUENCE).

In the following LPD_SEQUENCE refers to all allowed window/coding modecombinations inside the so called linear prediction domain codec (seesection 1.3). In the context of decoding a frequency domain coded frameit is important to know only if a following frame is encoded with the LPdomain coding modes, which is represented by an LPD_SEQUENCE. However,the exact structure within the LPD_SEQUENCE is taken care of whendecoding the LP domain coded frame.

2.3 Decoding Process

2.3.1 IMDCT

The analytical expression of the IMDCT is:

$x_{i,n} = {{\frac{2}{N}{\sum\limits_{k = 0}^{\frac{N}{2} - 1}{{{{spec}\lbrack i\rbrack}\lbrack k\rbrack}{\cos( {\frac{2\pi}{N}( {n + n_{0}} )( {k + \frac{1}{2}} )} )}\mspace{25mu}{for}\mspace{14mu} 0}}} \leq n < N}$

-   -   where:    -   n=sample index    -   i=window index    -   k=spectral coefficient index    -   N=window length based on the window_sequence value    -   n₀=(N/2+1)/2

The synthesis window length N for the inverse transform is a function ofthe syntax element window sequence and the algorithmic context. It isdefined as follows:

Window length 2304:

$N = \{ \begin{matrix}{2304,} & {{if}\mspace{14mu}{STOP\_}1152{\_ SEQUENCE}} \\{2304,} & {{if}\mspace{14mu}{STOP\_ START}\_ 1152{\_ SEQUENCE}}\end{matrix} $

Window length 2048:

$N = \{ \begin{matrix}{2048,} & {{if}\mspace{14mu}{ONLY\_ LONG}{\_ SEQUENCE}} \\{2048,} & {{if}\mspace{14mu}{LONG\_ START}{\_ SEQUENCE}} \\{256,} & {{if}\mspace{14mu}{EIGHT\_ SHORT}{\_ SEQUENCE}} \\{2048,} & {{if}\mspace{14mu}{LONG\_ STOP}{\_ SEQUENCE}} \\{2048,} & {{if}\mspace{14mu}{STOP\_ START}{\_ SEQUENCE}} \\{2048,} & {{if}\mspace{14mu}{LPD\_ START}{\_ SEQUENCE}}\end{matrix} $

The meaningful block transitions are as follows:

${From}\mspace{14mu}{ONLY\_ LONG}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{ONLY\_ LONG}{\_ SEQUENCE}} \\{{LONG\_ START}{\_ SEQUENCE}} \\{{LPD\_ START}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{LONG\_ START}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{EIGHT\_ SHORT}{\_ SEQUENCE}} \\{{LONG\_ STOP}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{LONG\_ STOP}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{ONLY\_ LONG}{\_ SEQUENCE}} \\{{LONG\_ START}{\_ SEQUENCE}} \\{{LPD\_ START}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{EIGHT\_ SHORT}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{EIGHT\_ SHORT}{\_ SEQUENCE}} \\{{LONG\_ STOP}{\_ SEQUENCE}} \\{{STOP\_ START}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{LPD\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{LPD\_ SEQUENCE} \\{{STOP\_}1152{\_ SEQUENCE}} \\{{STOP\_ START}\_ 1152{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{STOP\_ START}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{EIGHT\_ SHORT}{\_ SEQUENCE}} \\{{LONG\_ STOP}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{LPD\_ START}{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {{LPD\_ SEQUENCE}{from}\mspace{14mu}{STOP\_}1152{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ {\begin{matrix}{{ONLY\_ LONG}{\_ SEQUENCE}} \\{{LONG\_ START}{\_ SEQUENCE}}\end{matrix}{from}\mspace{14mu}{STOP\_ START}\_ 1152{\_ SEQUENCE}\mspace{14mu}{to}\mspace{14mu}\{ \begin{matrix}{{EIGHT\_ SHORT}{\_ SEQUENCE}} \\{{LONG\_ STOP}{\_ SEQUENCE}}\end{matrix} } } } } } } } } $2.3.2 Windowing and Block Switching

Depending on the window_sequence and window_shape element differenttransform windows are used. A combination of the window halves describedas follows offers all possible window_sequences.

For window_shape=1, the window coefficients are given by theKaiser-Bessel derived (KBD) window as follows:

${W_{{KBD\_ LEFT},N}(n)} = {{\sqrt{\frac{\sum\limits_{p = 0}^{n}\lbrack {W^{\prime}( {p,\alpha} )} \rbrack}{\sum\limits_{p = 0}^{N/2}\lbrack {W^{\prime}( {p,\alpha} )} \rbrack}}\mspace{14mu}{for}\mspace{14mu} 0} \leq n < \frac{N}{2}}$${W_{{KBD\_ RIGHT},N}(n)} = {{\sqrt{\frac{\sum\limits_{p = 0}^{N - n - 1}\lbrack {W^{\prime}( {p,\alpha} )} \rbrack}{\sum\limits_{p = 0}^{N/2}\lbrack {W^{\prime}( {p,\alpha} )} \rbrack}}\mspace{14mu}{for}\mspace{14mu}\frac{N}{2}} \leq n < N}$

where:

W′, Kaiser-Bessel kernel window function, see also [5], is defined asfollows:

${W^{\prime}( {n,\alpha} )} = {{\frac{I_{0}\lbrack {{\pi\alpha}( \sqrt{1.0 - ( \frac{n - {N/4}}{N/4} )} )}^{2} \rbrack}{I_{0}\lbrack{\pi\alpha}\rbrack}\mspace{14mu}{for}\mspace{14mu} 0} \leq n \leq \frac{N}{2}}$${I_{0}\lbrack x\rbrack} = {\sum\limits_{k = 0}^{\infty}\lbrack \frac{( \frac{x}{2} )^{k}}{k!} \rbrack^{2}}$${\alpha = {{kernel}\mspace{14mu}{window}\mspace{14mu}{alpha}\mspace{14mu}{factor}}},{\alpha = \{ \begin{matrix}{{4\mspace{14mu}{for}\mspace{14mu} N} = {2048\mspace{11mu}(1920)}} \\{{6\mspace{14mu}{for}\mspace{14mu} N} = {256\mspace{11mu}(240)}}\end{matrix} }$

Otherwise, for window_shape=0, a sine window is employed as follows:

${W_{{SIN\_ LEFT},N}(n)} = {{{\sin( {\frac{\pi}{N}( {n + \frac{1}{2}} )} )}\mspace{14mu}{for}\mspace{14mu} 0} \leq n < \frac{N}{2}}$${W_{{SIN\_ RIGHT},N}(n)} = {{{\sin( {\frac{\pi}{N}( {n + \frac{1}{2}} )} )}\mspace{14mu}{for}\mspace{14mu}\frac{N}{2}} \leq n < N}$

The window length N can be 2048 (1920) or 256 (240) for the KBD and thesine window. In case of STOP_(—)1152_SEQUENCE andSTOP_START_(—)1152_SEQUENCE, N can still be 2048 or 256, the windowslopes are similar but the flat top regions are longer.

Only in the case of LPD_START_SEQUENCE the right part of the window is asine window of 64 samples.

How to obtain the possible window sequences is explained in the partsa)-h) of this subclause.

For all kinds of window_sequences the window_shape of the left half ofthe first transform window is determined by the window shape of theprevious block. The following formula expresses this fact:

${W_{{LEFT},N}(n)} = \{ \begin{matrix}{{W_{{KBD\_ LEFT},N}(n)},{{{if}\mspace{14mu}{window\_ shape}{\_ previous}{\_ block}}==1}} \\{{W_{{SIN\_ LEFT},N}(n)},{{{if}\mspace{14mu}{window\_ shape}{\_ previous}{\_ block}}==0}}\end{matrix} $

where:

window_shape_previous_block: window_shape of the previous block (i−1).

For the first raw_data_block( )to be decoded the window_shape of theleft and right half of the window are identical.

a) ONLY_LONG_SEQUENCE:

The window sequence=ONLY_LONG_SEQUENCE is equal to one LONG_WINDOW witha total window length N_(—l of) 2048 (1920).

For window_shape=1 the window for ONLY_LONG_SEQUENCE is given asfollows:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ l}/2}} \\{{W_{{SIN\_ RIGHT},{N\_ l}}(n)},} & {{{for}\mspace{14mu}{N\_}{1/2}} \leq n < {N\_ l}}\end{matrix} $

If window_shape=0 the window for ONLY_LONG_SEQUENCE can be described asfollows:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ l}/2}} \\{{W_{{KBD\_ RIGHT},{N\_ l}}(n)},} & {{{for}\mspace{14mu}{N\_}{1/2}} \leq n < {N\_ l}}\end{matrix} $

After windowing, the time domain values (z_(i,n)) can be expressed as:z _(i,n) =w(n)·x_(i,n);

b) LONG_START_SEQUENCE:

The LONG_START_SEQUENCE is needed to obtain a correct overlap and addfor a block transition from a ONLY_LONG_SEQUENCE to aEIGHT_SHORT_SEQUENCE.

Window length N_l and N_s is set to 2048 (1920) and 256 (240)respectively.

If window_shape=1 the window for LONG_START_SEQUENCE is given asfollows:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ l}/2}} \\{1.0,} & {{{for}\mspace{14mu}{{N\_ l}/2}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{KBD\_ RIGHT},{N\_ s}}( {n + \frac{N\_ s}{2} - \frac{{3\;{N\_ l}} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} + {N\_ s}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} + {N\_ s}}{4}} \leq n < {N\_ l}}\end{matrix} $

If window _shape=0 the window for LONG_START_SEQUENCE looks like:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ l}/2}} \\{1.0,} & {{{for}\mspace{14mu}{{N\_ l}/2}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{SIN\_ RIGHT},{N\_ s}}( {n + \frac{N\_ s}{2} - \frac{{3\;{N\_ l}} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} + {N\_ s}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} + {N\_ s}}{4}} \leq n < {N\_ l}}\end{matrix} $

The windowed time-domain values can be calculated with the formulaexplained in a).

c) EIGHT_(—SHORT)

The window_sequence=EIGHT_SHORT comprises eight overlapped and addedSHORT_WINDOWs with a length N_s of 256 (240) each. The total length ofthe window_sequence together with leading and following zeros is 2048(1920). Each of the eight short blocks are windowed separately first.The short block number is indexed with the variable j=0, . . . , M−1(M=N_l/N_s).

The window_shape of the previous block influences the first of the eightshort blocks (W₀(n)) only. If window_shape=1 the window functions can begiven as follows:

${W_{0}(n)} = \{ {{\begin{matrix}{{W_{{LEFT},{N\_ s}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ s}/2}} \\{{W_{{KBD\_ RIGHT},{N\_ s}}(n)},} & {{{for}\mspace{14mu}{{N\_ s}/2}} \leq n < {N\_ s}}\end{matrix}{W_{1 - {({M - 1})}}(n)}} = \{ \begin{matrix}{W_{{KBD\_ LEFT},{N\_ s}}(n)} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ s}/2}} \\{{W_{{KBD\_ RIGHT},{N\_ s}}(n)},} & {{{for}\mspace{14mu}{{N\_ s}/2}} \leq n < {N\_ s}}\end{matrix} } $

Otherwise, if window_shape=0, the window functions can be described as:

${W_{0}(n)} = \{ {{\begin{matrix}{{W_{{LEFT},{N\_ s}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ s}/2}} \\{{W_{{SIN\_ RIGHT},{N\_ s}}(n)},} & {{{for}\mspace{14mu}{{N\_ s}/2}} \leq n < {N\_ s}}\end{matrix}{W_{1 - {({M - 1})}}(n)}} = \{ \begin{matrix}{{W_{{SIN\_ LEFT},{N\_ s}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < {{N\_ s}/2}} \\{{W_{{SIN\_ RIGHT},{N\_ s}}(n)},} & {{{for}\mspace{14mu}{{N\_ s}/2}} \leq n < {N\_ s}}\end{matrix} } $

The overlap and add between the EIGHT_SHORT window_sequence resulting inthe windowed time domain values is described as follows:

$z_{i,n} = \{ \begin{matrix}{0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{{N\_ l} - {N\_ s}}{4}} \\{{x_{0,{n - \frac{{N\_ l} - {N\_ s}}{4}}} \cdot {W_{0}( {n - \frac{{N\_ l} - {N\_ s}}{4}} )}},} & {{{for}\mspace{14mu}\frac{{N\_ l} - {N\_ s}}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\\begin{matrix}{{x_{{j - 1},{n - \frac{{N\_ l} + {({{2\; j} - 3})} - {N\_ s}}{4}}} \cdot {W_{j - 1}( {n - \frac{{N\_ l} + {( {{2\; j} - 3} ){N\_ s}}}{4}} )}} +} \\{{x_{j,{n - \frac{{N\_ l} + {{({{2\; j} - 1})}{N\_ s}}}{4}}} \cdot {W_{j}( {n - \frac{{N\_ l} + {( {{2j} - 1} ){N\_ s}}}{4}} )}},}\end{matrix} & \begin{matrix}{{{{for}\mspace{14mu} 1} \leq j < M},} \\{\frac{{N\_ l} + {( {{2\; j} - 1} ){N\_ s}}}{4} \leq n < \frac{{N\_ l} + {( {{2\; j} + 1} ){N\_ s}}}{4}}\end{matrix} \\{{x_{{M - 1},{n - \frac{{N\_ l} + {{({{2M} - 3})}{N\_ s}}}{4}}} \cdot {W_{M - 1}( {n - \frac{{N\_ l} + {( {{2M} - 3} ){N\_ s}}}{4}} )}},} & {{{for}\mspace{14mu}\frac{{N\_ l} + {( {{2\; M} - 1} ){N\_ s}}}{4}} \leq n < \frac{{N\_ l} + {( {{2\; M} + 1} ){N\_ s}}}{4}} \\{0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {( {{2\; M} + 1} ){N\_ s}}}{4}} \leq n < {N\_ l}}\end{matrix} $

d) LONG_STOP_SEQUENCE

This window_sequence is needed to switch from a EIGHT_SHORT_SEQUENCEback to a ONLY_LONG_SEQUENCE.

If window_shape=1 the window for LONG_STOP_SEQUENCE is given as follows:

${W(n)} = \{ \begin{matrix}{0.1,} & {{{for}\mspace{14mu} 0} \leq n < \frac{{N\_ l} - {N\_ s}}{4}} \\{{W_{{LEFT},{N\_ s}}( {n - \frac{{N\_ l} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{N\_ l} - {N\_ s}}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {N\_ s}}{4}} \leq n < {{N\_ l}/2}} \\{{W_{{KBD\_ RIGHT},{N\_ l}}(n)},} & {{{for}\mspace{14mu}{{N\_ l}/2}} \leq n < {N\_ l}}\end{matrix} $

If window shape=0 the window for LONG_START_SEQUENCE is determined by:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{{N\_ l} - {N\_ s}}{4}} \\{{W_{{LEFT},{N\_ s}}( {n - \frac{{N\_ l} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{N\_ l} - {N\_ s}}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {N\_ s}}{4}} \leq n < {{N\_ l}/2}} \\{{W_{{SIN\_ RIGHT},{N\_ l}}(n)},} & {{{for}\mspace{14mu}{{N\_ l}/2}} \leq n < {N\_ l}}\end{matrix} $

The windowed time domain values can be calculated with the formulaexplained in a).

e) STOP_START_SEQUENCE:

The STOP_START_SEQUENCE is needed to obtain a correct overlap and addfor a block transition from a EIGHT_SHORT_SEQUENCE to aEIGHT_SHORT_SEQUENCE when just a ONLY_LONG_SEQUENCE is needed.

Window length N_l and N_s is set to 2048 (1920) and 256 (240)respectively.

If window_shape=1 the window for STOP_START_SEQUENCE is given asfollows:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{{N\_ l} - {N\_ s}}{4}} \\{{W_{{LEFT},{N\_ s}}( {n - \frac{{N\_ l} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{N\_ l} - {N\_ s}}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{KBD\_ RIGHT},{N\_ s}}( {n + \frac{N\_ s}{2} - \frac{{3\;{N\_}1} - {{N\_}3}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} + {N\_ s}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} + {N\_ s}}{4}} \leq n < {N\_ l}}\end{matrix} $

If window_shape=0 the window for STOP_START_SEQUENCE looks like:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{{N\_ l} - {N\_ s}}{4}} \\{{W_{{LEFT},{N\_ s}}( {n - \frac{{N\_ l} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{N\_ l} - {N\_ s}}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{SIN\_ RIGHT},{N\_ s}}( {n + \frac{N\_ s}{2} - \frac{{3\;{N\_}1} - {{N\_}3}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{{3\;{N\_ l}} + {N\_ s}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} + {N\_ s}}{4}} \leq n < {N\_ l}}\end{matrix} $

The windowed time-domain values can be calculated with the formulaexplained in a).

f) LPD_START_SEQUENCE:

The LPD_START_SEQUENCE is needed to obtain a correct overlap and add fora block transition from a ONLY_LONG_SEQUENCE to a LPD_SEQUENCE.

Window length N_l and N_s is set to 2048 (1920) and 256 (240)respectively.

If window_shape=1the window for LPD_START_SEQUENCE is given as follows:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{2}} \\{1.0,} & {{{for}\mspace{14mu}\frac{N\_ l}{2}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{KBD\_ RIGHT},\frac{N\_ s}{2}}( {n + \frac{N\_ s}{4} - \frac{{3\;{N\_ l}} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{3\;{N\_ l}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{3\;{N\_ l}}{4}} \leq n < {N\_ l}}\end{matrix} $

If window_shape=0 the window for LPD_START_SEQUENCE looks like:

${W(n)} = \{ \begin{matrix}{{W_{{LEFT},{N\_ l}}(n)},} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{2}} \\{1.0,} & {{{for}\mspace{14mu}\frac{N\_ l}{2}} \leq n < \frac{{3\;{N\_ l}} - {N\_ s}}{4}} \\{{W_{{SIN\_ RIGHT},\frac{N\_ s}{2}}( {n + \frac{N\_ s}{4} - \frac{{3\;{N\_ l}} - {N\_ s}}{4}} )},} & {{{for}\mspace{14mu}\frac{{3\;{N\_ l}} - {N\_ s}}{4}} \leq n < \frac{3\;{N\_ l}}{4}} \\{0.0,} & {{{for}\mspace{14mu}\frac{3\;{N\_ l}}{4}} \leq n < {N\_ l}}\end{matrix} $

The windowed time-domain values can be calculated with the formulaexplained in a).

g) STOP_(—)1152_SEQUENCE:

The STOP_(—)1152_SEQUENCE is needed to obtain a correct overlap and addfor a block transition from a LPD_SEQUENCE to ONLY_LONG_SEQUENCE.

Window length N_l and N_s is set to 2048 (1920) and 256 (240)respectively.

If window_shape=1 the window for STOP_(—)1152_SEQUENCE is given asfollows:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{4}} \\{{W_{{LEFT},{N\_ s}}( {n - \frac{N\_ l}{4}} )},} & {{{for}\mspace{14mu}\frac{N\_ l}{4}} \leq n < \frac{{N\_ l} + {N\_ s}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {2\;{N\_ s}}}{4}} \leq n < \frac{{2\;{N\_ l}} + {3\;{N\_ s}}}{4}} \\{{W_{{KBD\_ RIGHT},{N\_ l}}( {n + \frac{N\_ l}{2} - \frac{{2{N\_}1} + {3\;{N\_ s}}}{4}} )},} & {{{for}\mspace{14mu}\frac{{2\;{N\_ l}} + {3\;{N\_ s}}}{4}} \leq n < {{N\_ l} + \frac{3\;{N\_ s}}{4}}} \\{0.0,} & {{{{for}\mspace{14mu}{N\_ l}} + \frac{3\;{N\_ s}}{4}} \leq n < {{N\_ l} + {N\_ s}}}\end{matrix} $

If window_shape=0 the window for STOP_(—)1152_SEQUENCE looks like:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{4}} \\{{W_{{LEFT},{N\;{\_ s}}}( {n - \frac{N\_ l}{4}} )},} & {{{for}\mspace{14mu}\frac{N\_ l}{4}} \leq n < \frac{{N\_ l} + {2\;{N\_ s}}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {2\;{N\_ s}}}{4}} \leq n < \frac{{2{N\_ l}} + {3{N\_ s}}}{4}} \\{{W_{{SIN\_ RIGHT},{{N\_}l}}( {n + \frac{N\_ l}{2} - \frac{{2{N\_ l}} + {3{N\_ s}}}{4}} )},} & {{{for}\mspace{14mu}\frac{{2{N\_ l}} + {3{N\_ s}}}{4}} \leq n < {{N\_ l} + \frac{3{N\_ s}}{4}}} \\{0.0,} & {{{{for}\mspace{14mu}{N\_ l}} + \frac{3{N\_ s}}{4}} \leq n < {{N\_ l} + {N\_ s}}}\end{matrix} $

The windowed time-domain values can be calculated with the formulaexplained in a).

h) STOP_START_(—)1152_SEQUENCE:

The STOP_START_(—)1152_SEQUENCE is needed to obtain a correct overlapand add for a block transition from a LPD_SEQUENCE to aEIGHT_SHORT_SEQUENCE when just a ONLY_LONG_SEQUENCE is needed.

Window length N_l and N_s is set to 2048 (1920) and 256 (240)respectively.

If window_shape=1 the window for STOP_START_SEQUENCE is given asfollows:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{4}} \\{{W_{{LEFT},{N\;{\_ s}}}( {n - \frac{N\_ l}{4}} )},} & {{{for}\mspace{14mu}\frac{N\_ l}{4}} \leq n < \frac{{N\_ l} + {2\;{N\_ s}}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {2\;{N\_ s}}}{4}} \leq n < {\frac{3{N\_ l}}{4} + \frac{N\_ s}{2}}} \\{{W_{{KBD\_ RIGHT},{{N\_}s}}( {n + \frac{N\_ s}{2} - \frac{3{N\_ l}}{4} + \frac{N\_ s}{2}} )},} & {{{{for}\mspace{14mu}\frac{3{N\_ l}}{4}} + \frac{N\_ s}{2}} \leq n < {\frac{3{N\_ l}}{4} + {N\_ s}}} \\{0.0,} & {{{{for}\mspace{14mu}\frac{3{N\_ l}}{4}} + {N\_ s}} \leq n < {{N\_ l} + {N\_ s}}}\end{matrix} $

If window shape=0 the window for STOP_START_SEQUENCE looks like:

${W(n)} = \{ \begin{matrix}{0.0,} & {{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{4}} \\{{W_{{LEFT},{N\;{\_ s}}}( {n - \frac{N\_ l}{4}} )},} & {{{for}\mspace{14mu}\frac{N\_ l}{4}} \leq n < \frac{{N\_ l} + {2\;{N\_ s}}}{4}} \\{1.0,} & {{{for}\mspace{14mu}\frac{{N\_ l} + {2\;{N\_ s}}}{4}} \leq n < {\frac{3{N\_ l}}{4} + \frac{N\_ s}{2}}} \\{{W_{{SIN\_ RIGHT},{{N\_}s}}( {n + \frac{N\_ s}{2} - \frac{3{N\_ l}}{4} + \frac{N\_ s}{2}} )},} & {{{{for}\mspace{14mu}\frac{3{N\_ l}}{4}} + \frac{N\_ s}{2}} \leq n < {\frac{3{N\_ l}}{4} + {N\_ s}}} \\{0.0,} & {{{{for}\mspace{14mu}\frac{3{N\_ l}}{4}} + {N\_ s}} \leq n < {{N\_ l} + {N\_ s}}}\end{matrix} $

The windowed time-domain values can be calculated with the formulaexplained in a).

2.3.3 Overlapping and Adding with Previous Window Sequence

Besides the overlap and add within the EIGHT_SHORT window_sequence thefirst (left) part of every window_sequence is overlapped and added withthe second (right) part of the previous window_sequence resulting in thefinal time domain values out_(i,n). The mathematic expression for thisoperation can be described as follows.

In case of ONLY_LONG_SEQUENCE, LONG_START_SEQUENCE,EIGHT_SHORT_SEQUENCE, LONG_STOP_SEQUENCE, STOP_START_SEQUENCE,LPD_START_SEQUENCE:

${{out}_{i,n} = {z_{i,n} + z_{{i - 1},{n + \frac{N}{s}}}}};$${{{for}\mspace{14mu} 0} \leq n < \frac{N}{2}},{N = {2048(1920)}}$

And in case of STOP_(—)1152_SEQUENCE, STOP_START_(—)1152_SEQUENCE:

${{out}_{i,n} = {z_{i,n} + z_{{i - 1},{n + \frac{N\;\_ 1}{2} + \frac{3\; N\;\_\; s}{4}}}}};$${{{for}\mspace{14mu} 0} \leq n < \frac{N\_ l}{2}},{{N\_ l} = 2048},{{N\_ s} = 256}$

In case of LPD_START_SEQUENCE, the next sequence is a LPD_SEQUENCE. ASIN or KBD window is apply on the left part of the LPD_SEQUENCE to havea good overlap and add.

${W_{{SIN\_ LEFT},N}(n)} = {{{{\sin( {\frac{\pi}{N}( {n + \frac{1}{2}} )} )}\mspace{14mu}{for}\mspace{20mu} 0} \leq n < {\frac{N}{2}\mspace{14mu}{With}\mspace{14mu} N}} = 128}$

In case of STOP_(—)1152_SEQUENCE, STOP_START_(—)1152_SEQUENCE theprevious sequence is a LPD_SEQUENCE. A TDAC window is apply on the rightpart of the LPD_SEQUENCE to have a good overlap and add.

3.1 Windowing and Block Switching

Depending on the window_shape element different oversampled transformwindow prototypes are used, the length of the oversampled windows isN _(os)=2·n_long·os_factor_win

For window_shape=1, the window coefficients are given by theKaiser-Bessel derived (KBD) window as follows:

${W_{KBD}( {n - \frac{N_{os}}{2}} )} = {{\sqrt{\frac{\sum\limits_{p = 0}^{N_{os} - n - 1}\;\lbrack {W( {p,\alpha} )} \rbrack}{\sum\limits_{p = 0}^{N_{os}/2}\;\lbrack {W( {p,\alpha} )} \rbrack}}\mspace{20mu}{for}\mspace{14mu}\frac{N_{os}}{2}} \leq n < N_{os}}$

where: W′, Kaiser-Bessel kernel window function, see also [5], isdefined as follows:

${W^{\prime}( {n,\alpha} )} = {{\frac{I_{0}\lfloor {{\pi\alpha}\sqrt{1.0 - ( \frac{n - {N_{os}/4}}{N_{os}/4} )}} \rfloor}{I_{0}\lbrack{\pi\alpha}\rbrack}\mspace{14mu}{for}\mspace{14mu} 0} \leq n < \frac{N_{os}}{2}}$${I_{0}\lbrack x\rbrack} = {\sum\limits_{k = 0}^{\infty}\;\lbrack \frac{( \frac{x}{2} )^{k}}{k!} \rbrack^{2}}$

α=kernel window alpha factor, α=4

Otherwise, for window_shape=0, a sine window is employed as follows:

${W_{SIN}( {n - \frac{N_{os}}{2}} )} = {{{\sin( {\frac{\pi}{N_{os}}( {n + \frac{1}{2}} )} )}\mspace{14mu}{for}\mspace{14mu}\frac{N_{os}}{2}} \leq n < N_{os}}$

For all kinds of window_sequences the used protoype for the left windowpart is the determined by the window shape of the previous block. Thefollowing formula expresses this fact:

${{left\_ window}{{\_ shape}\lbrack n\rbrack}} = \{ \begin{matrix}{{W_{KBD}\lbrack n\rbrack},\mspace{14mu}{{{if}\mspace{20mu}{window\_ shape}\_\;{previous\_ block}} == 1}} \\{{W_{SIN}\lbrack n\rbrack},\mspace{14mu}{{{if}\mspace{20mu}{window\_ shape}\_\;{previous\_ block}} == 0}}\end{matrix} $

Likewise the prototype for the right window shape is determined by thefollowing formula:

${{right\_ window}{{\_ shape}\lbrack n\rbrack}} = \{ \begin{matrix}{{W_{KBD}\lbrack n\rbrack},\mspace{14mu}{{{if}\mspace{20mu}{window\_ shape}} == 1}} \\{{W_{SIN}\lbrack n\rbrack},\mspace{14mu}{{{if}\mspace{20mu}{window\_ shape}} == 0}}\end{matrix} $

Since the transition lengths are already determined, it only has to bedifferentiated between

EIGHT_SHORT_SEQUENCES and all other: a)EIGHT SHORT SEQUENCE:

The following c-code like portion describes the windowing and internaloverlap-add of a EIGHT_SHORT_SEQUENCE:

  tw_windowing_short(X[ ][ ],z[],first_pos,last_pos,warpe_trans_len_left,warped_trans_len_right,left_window_shape[ ],right_window_shape[ ]){  offset = n_long −4*n_short − n_short/2;  tr_scale_l =0.5*n_long/warped_trans_len_left*os_factor_win;  tr_pos_1 =warped_trans_len_left+(first_pos-n_long/2)+0.5)*tr_scale_1;  tr_scale_r= 8*os_factor_win;  tr_pos_r = tr_scale_r/2;  for ( i = 0 ; i < n_short; i++ ) {   z[i] = X[0][i];  }  for(i=0;i<first_pos;i++)   z[i] = 0.; for(i=n_long-1-first_pos;i>=first_pos;i--) {   z[i] *=left_window_shape[floor(tr_pos_1)];   tr_pos_1 += tr_scale_1;  } for(i=0;i<n_short;i++) {   z[offset+i+n_short]=    X[0][i+n_short]*right_window_shape[floor(tr_pos_r)];   tr_pos_r+=tr_scale_r;  }  offset +=n_short;  for ( k = 1 ; k < 7 ; k++) {  tr_scale_1 = n_short*os_factor_win;   tr_pos_1 = tr_scale_1/2;  tr_pos_r = os_factor_win*n_long-tr_pos_1;   for ( i = 0 ; i < n_short; i++) {    z[i + offset] +=X[k][i]*right_window_shape[floor(tr_pos_r)];   z[offset + n_short + i] =     X[k][n_short +i]*right_window_shape[floor(tr_pos_1)];    tr_pos_1 += tr_scale_1;   tr_pos_r −= tr_scale_1;   }   offset +=n_short;  }  tr_scale_1 =n_short*os_factor_win;  tr_pos_1 = tr_scale_1/2;  for ( i = n_short − 1; i >= 0 ; i-- ) {   z[i + offset] += X[7][i]*right_window_shape[(int)floor(tr_pos_1)];   tr_pos_1 += tr_scale_1;  }   for ( i = 0 ; i <n_short ; i++) {   z[offset + n_short + i] = X[7][n_short + i];  } tr_scale_r = 0.5*n_long/warpedTransLenRight*os_factor_win;  tr_pos_r =0.5*tr_scale_r+.5;  tr_pos_r =(1.5*n_long-(float)wEnd-0.5+warpedTransLenRight)*tr_scale_r; for(i=3*n_long-1-last_pos ;i<=wEnd;i++) {   z[i]*=right_window_shape[floor(tr_pos_r)];   tr_pos_r +=tr_scale_r;  } for(i=lsat_pos+1;i<2*n_long;i++)   z[i] = 0.;

b) all others:

  tw_windowing_long(X[ ][ ],z[],first_pos,last_pos,warpe_trans_len_left,warped_trans_len_right,left_window_shape[ ]right_window_shape[ ]){ for(i=0;i<first_pos;i++)   z[i] = 0.;  for(i=last_pos+1;i<N;i++)   z[i]= 0.;  tr_scale = 0.5*n_long/warped_trans_len_left*os_factor_win; tr_pos = (warped_trans_len_left+first_pos-N/4)+0.5)*tr_scale; for(i=N/2-1-firstpos;i>=firstpos;i--) {   z[i] =X[0][i]*left_window_shape[floor(tr_pos)]);   tr_pos += tr_scale;  } tr_scale = 0.5*n_long/warped_trans_len_right*os_factor_win;  tr_pos =(3*N/4-last_pos-0.5+warped_trans_len_right)*tr_scale; for(i=3*N/2-1-last_pos;i<=last_pos;i++) {   z[i] =X[0][i]*right_window_shape[floor(tr_pos)]);   tr_pos += tr_scale;  } }4. MDCT Based TCX4.1 Tool Description

When the core_mode is equal to 1 and when one or more of the three TCXmodes is selected as the “linear prediction-domain” coding, i.e. one ofthe 4 array entries of mod[ ] is greater than 0, the MDCT based TCX toolis used. The MDCT based TCX receives the quantized spectral coefficientsfrom the arithmetic decoder. The quantized coefficients are firstcompleted by a comfort noise before applying an inverse MDCTtransformation to get a time-domain weighted synthesis which is then fedto the weighting synthesis LPC-filter

4.2 Definitions

lg Number of quantized spectral coefficients output by the arithmeticdecoder noise_factor Noise level quantization index noise level Level ofnoise injected in reconstructed spectrum noise[ ] Vector of generatednoise global_gain Re-scaling gain quantization index g Re-scaling gainrms Root mean square of the synthesized time-domain signal, x[ ], x[ ]Synthetized time-domain signal4.3 Decoding Process

The MDCT-based TCX requests from the arithmetic decoder a number ofquantized spectral coefficients, lg, which is determined by the mod[ ]and last_lpd_mode values. These two values also define the window lengthand shape which will be applied in the inverse MDCT. The window iscomposed of three parts, a left side overlap of L samples, a middle partof ones of M samples and a right overlap part of R samples. To obtain anMDCT window of length 2*lg, ZL zeros are added on the left and ZR zeroson the right side as indicated in FIG. 14G for Table 3/FIG. 14F.

TABLE 3 Number of Spectral Coefficients as a Function of last_lpd_modeand mod[ ] value Number Ig Value of of of spectral last_lpd_mode mod[x]coefficients ZL L M R ZR 0 1 320 160 0 256 128 96 0 2 576 288 0 512 128224 0 3 1152 512 128 1024 128 512 1 . . . 3 1 256 64 128 128 128 64 1 .. . 3 2 512 192 128 384 128 192 1 . . . 3 3 1024 448 128 896 128 448

The MDCT window is given by

${W(n)} = \{ \begin{matrix}0 & {{{for}\mspace{14mu} 0} \leq n < {ZL}} \\{W_{{SIN\_ LEFT},L}( {n - {ZL}} )} & {{{for}\mspace{14mu}{ZL}} \leq n < {{ZL} + L}} \\1 & {{{{for}\mspace{14mu}{ZL}} + L} \leq n < {{ZL} + L + M}} \\{W_{{SIN\_ RIGHT},R}( {n - {ZL} - L - M} )} & {{{{for}\mspace{14mu}{ZL}} + L + M} \leq n < {{ZL} + L + M + R}} \\0 & {{{{for}\mspace{14mu}{ZL}} + L + M + R} \leq n < {21\mspace{14mu} g}}\end{matrix} $

The quantized spectral coefficients, quant[ ], delivered by thearithmetic decoder are completed by a comfort noise. The level of theinjected noise is determined by the decoded noise_factor as follows:noise_level=0.0625*(8-noise_factor)

A noise vector, noise[ ], is then computed using a random function,random_sign( ), delivering randomly the value −1 or +1.noise[i]=random_sign( )*noise_level;

The quant[ ] and noise[ ] vectors are combined to form the reconstructedspectral coefficients vector, r[ ], in a way that the runs of 8consecutive zeros in quant[ ] are replaced by the components of noise[]. A run of 8 non-zeros are detected according to the formula:

$\quad\{ \begin{matrix}{{{rl}\lbrack i\rbrack} = {1\mspace{14mu}{for}\mspace{14mu} i\;{\varepsilon\lbrack {0,{1\mspace{14mu} g/{6\lbrack}}} }}} \\{{{rl}\lbrack {{1\mspace{14mu} g/6} + i} \rbrack} = {\sum\limits_{k = 0}^{7}\;{{{{quant}\lbrack {{1\mspace{14mu} g/6} + {8.\lfloor {i/8} \rfloor} + k} \rbrack}}\mspace{14mu}{for}\mspace{14mu}{{i\varepsilon}\lbrack {0,{7.1\mspace{14mu} g/{6\lbrack}}} }}}}\end{matrix} $

One obtains the reconstructed spectrum as follows:

${r\lbrack i\rbrack} = \{ \begin{matrix}{{{{quant}\lbrack i\rbrack}\mspace{14mu}{if}\mspace{14mu}{{rl}\lbrack i\rbrack}} = 1} \\{{{noise}\lbrack i\rbrack}\mspace{14mu}{otherwise}}\end{matrix} $

Prior to applying the inverse MDCT a spectrum de-shaping is appliedaccording to the following steps:

-   -   1. calculate the energy E_(m), of the 8-dimensional block at        index m for each 8-dimensional block of the first quarter of the        spectrum    -   2. compute the ratio R_(m)=sqrt(E_(m)/E_(I)), where I is the        block index with the maximum value of all E_(m)    -   3. if R_(m)<0.1, then set R_(m)=0.1    -   4. if R_(m)<R_(m-I), then set R_(m)=R_(m-I)

Each 8-dimensional block belonging to the first quarter of spectrum arethen multiplying by the factor R_(m).

The reconstructed spectrum is fed in an inverse MDCT. The non-windowedoutput signal, x[ ], is re-scaled by the gain, g, obtained by an inversequantization of the decoded global_gain index:g=10^(global) ^(—) ^(gain/28/(2.rms))

Where rms is calculated as:

${rms} = \sqrt{\frac{\sum\limits_{i = {1\mspace{14mu} g/2}}^{{3*1\mspace{14mu} g/2} - 1}\;{x^{2}\lbrack i\rbrack}}{L + M + R}}$

The rescaled synthesized time-dome signal is then equal to:x _(w) [i]=x[i]·g

After rescaling the windowing and overlap add is applied.

The reconstructed TCX target x(n) is then filtered through thezero-state inverse weighted synthesis filter Â(z)(1−αz⁻¹)/(Â(z/λ) tofind the excitation signal which will be applied to the synthesisfilter. Note that the interpolated LP filter per subframe is used in thefiltering. Once the excitation is determined, the signal isreconstructed by filtering the excitation through synthesis filter1/Â(z) and then de-emphasizing by filtering through the filter1(1−0.68z⁻¹) as described above.

Note that the excitation is also needed to update the ACELP adaptivecodebook and allow to switch from TCX to ACELP in a subsequent frame.Note also that the length of the TCX synthesis is given by the TCX framelength (without the overlap): 256, 512 or 1024 samples for the mod[ ] of1,2 or 3 respectively.

Normative References

-   [1] ISO/IEC 11172-3:1993, Information technology—Coding of moving    pictures and associated audio for digital storage media at up to    about 1.5 Mbit/s, Part 3: Audio.-   [2] ITU-T Rec.H.222.0(1995) I ISO/IEC 13818-1:2000, Information    technology—Generic coding of moving pictures and associated audio    information:—Part 1: Systems.-   [3] ISO/IEC 13818-3:1998, Information technology—Generic coding of    moving pictures and associated audio information:—Part 3: Audio.-   [4] ISO/IEC 13818-7:2004, Information technology—Generic coding of    moving pictures and associated audio information:—Part 7: Advanced    Audio Coding (AAC).-   [5] ISO/IEC 14496-3:2005, Information technology—Coding of    audio-visual objects—Part 1: Systems-   [6] ISO/IEC 14496-3:2005, Information technology—Coding of    audio-visual objects—Part 3: Audio-   [7] ISO/IEC 23003-1:2007, Information technology—MPEG audio    technologies—Part 1: MPEG Surround-   [8] 3GPP TS 26.290 V6.3.0, Extended Adaptive Multi-Rate—Wideband    (AMR-WB+) codec; Transcoding functions-   [9] 3GPP TS 26.190, Adaptive Multi-Rate—Wideband (AMR-WB) speech    codec; Transcoding functions-   [10] 3GPP TS 26.090, Adaptive Multi-Rate (AMR) speech codec;    Transcoding functions    Definitions

Definitions can be found in ISO/IEC 14496-3, subpart 1, subclause 1.3(Terms and definitions) and in 3GPP TS 26.290, section 3 (Definitionsand abbreviations).

Although some aspects have been described in the context of anapparatus, it is clear that these aspects also represent a descriptionof the corresponding method, where a block or device corresponds to amethod step or a feature of a method step. Analogously, aspectsdescribed in the context of a method step also represent a descriptionof a corresponding block or item or feature of a correspondingapparatus.

The inventive encoded audio signal can be stored on a digital storagemedium or can be transmitted on a transmission medium such as a wirelesstransmission medium or a wired transmission medium such as the Internet.

Depending on certain implementation requirements, embodiments of theinvention can be implemented in hardware or in software. Theimplementation can be performed using a digital storage medium, forexample a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROMor a FLASH memory, having electronically readable control signals storedthereon, which cooperate (or are capable of cooperating) with aprogrammable computer system such that the respective method isperformed.

Some embodiments according to the invention comprise a data carrierhaving electronically readable control signals, which are capable ofcooperating with a programmable computer system, such that one of themethods described herein is performed.

Generally, embodiments of the present invention can be implemented as acomputer program product with a program code, the program code beingoperative for performing one of the methods when the computer programproduct runs on a computer. The program code may for example be storedon a machine readable carrier.

Other embodiments comprise the computer program for performing one ofthe methods described herein, stored on a machine readable carrier.

In other words, an embodiment of the inventive method is, therefore, acomputer program having a program code for performing one of the methodsdescribed herein, when the computer program runs on a computer.

A further embodiment of the inventive methods is, therefore, a datacarrier (or a digital storage medium, or a computer-readable medium)comprising, recorded thereon, the computer program for performing one ofthe methods described herein.

A further embodiment of the inventive method is, therefore, a datastream or a sequence of signals representing the computer program forperforming one of the methods described herein. The data stream or thesequence of signals may for example be configured to be transferred viaa data communication connection, for example via the Internet.

A further embodiment comprises a processing means, for example acomputer, or a programmable logic device, configured to or adapted toperform one of the methods described herein.

A further embodiment comprises a computer having installed thereon thecomputer program for performing one of the methods described herein.

In some embodiments, a programmable logic device (for example a fieldprogrammable gate array) may be used to perform some or all of thefunctionalities of the methods described herein. In some embodiments, afield programmable gate array may cooperate with a microprocessor inorder to perform one of the methods described herein. Generally, themethods are advantageously performed by any hardware apparatus.

The above described embodiments are merely illustrative for theprinciples of the present invention. It is understood that modificationsand variations of the arrangements and the details described herein willbe apparent to others skilled in the art. It is the intent, therefore,to be limited only by the scope of the impending patent claims and notby the specific details presented by way of description and explanationof the embodiments herein.

While this invention has been described in terms of several embodiments,there are alterations, permutations, and equivalents which fall withinthe scope of this invention. It should also be noted that there are manyalternative ways of implementing the methods and compositions of thepresent invention. It is therefore intended that the following appendedclaims be interpreted as including all such alterations, permutationsand equivalents as fall within the true spirit and scope of the presentinvention.

The invention claimed is:
 1. Audio encoder for encoding an audio signal,comprising: a first coding branch for encoding an audio signal using afirst coding algorithm to acquire a first encoded signal, the firstcoding branch comprising the first converter for converting an inputsignal into a spectral domain; a second coding branch for encoding anaudio signal using a second coding algorithm to acquire a second encodedsignal, wherein the first coding algorithm is different from the secondcoding algorithm, the second coding branch comprising a domain converterfor converting an input signal from an input domain into an outputdomain, and a second converter for converting an input signal into aspectral domain; a switch for switching between the first coding branchand the second coding branch so that, for a portion of the audio inputsignal, either the first encoded signal or the second encoded signal isin an encoder output signal; a signal analyzer for analyzing the portionof the audio signal to determine, whether the portion of the audiosignal is represented as the first encoded signal or the second encodedsignal in the encoder output signal, wherein the signal analyzer isfurthermore configured for variably determining a respectivetime/frequency resolution of the first converter and the secondconverter, when the first encoded signal or the second encoded signalrepresenting the portion of the audio signal is generated; and an outputinterface for generating an encoder output signal comprising the firstencoded signal and the second encoded signal and information indicatingthe first encoded signal and the second encoded signal, and informationindicating the time/frequency resolution applied for encoding the firstencoded signal and for encoding the second encoded signal.
 2. Audioencoder in accordance with claim 1, in which the signal analyzer isconfigured for classifying the portion of the audio signal as aspeech-like audio signal or a music-like audio signal and for performinga transient detection in case of a music signal for determining thetime/frequency resolution of the first converter or for performing ananalysis-by-synthesis processing for determining the time/frequencyresolution of the second converter.
 3. Audio encoder in accordance withclaim 1, in which the first converter and the second converter comprisea variable windowed transform processor comprising a window functionwith a variable window size and a transform function with a variabletransform length, and wherein the signal analyzer is configured forcontrolling, based on the signal analysis, the window size and /or thetransform length.
 4. Audio encoder in accordance with claim 1, in whichthe second encoder branch comprises a first processing branch forprocessing an audio signal in the domain determined by the domainconverter, and a second processing branch comprising the secondconverter, wherein the signal analyzer is configured for sub-dividingthe portion of the audio signal into a sequence of sub-portions, andwherein the signal analyzer is configured for determining thetime/frequency resolution of the second converter depending on theposition of the sub-portion processed by the first processing branchwith respect to a sub-portion of the portion processed by the secondprocessing branch.
 5. Audio encoder in accordance with claim 4, in whichthe first processing branch comprises an ACELP encoder, in which thesecond processing branch comprises an MDCT-TCX processing device, inwhich the signal analyzer is configured for setting the time resolutionof the second converter to a first value determined by a length of asub-portion or a second value determined by a length of the sub-portionmultiplied by an integer value greater than one, wherein the secondvalue is lower than the first value.
 6. Audio encoder in accordance withclaim 1, in which the signal analyzer is configured for determining asignal classification in a constant raster covering a plurality ofequally sized blocks of audio samples, and for sub-dividing a block intoa variable number of blocks depending on the audio signal, wherein alength of the sub-block determines the first time/frequency resolutionor the second time/frequency resolution.
 7. Audio encoder in accordancewith claim 1, in which the second coding branch comprises: a firstprocessing branch for processing an audio signal; a second processingbranch, the second processing branch comprising the second converter;and a further switch for switching between the first processing branchand the second processing branch so that, for a portion of the audiosignal input into the second coding branch, either a first processedsignal or a second processed signal is in the second encoded signal. 8.Method of audio encoding an audio signal, comprising: encoding, in afirst coding branch, an audio signal using a first coding algorithm toacquire a first encoded signal, the first coding branch comprising thefirst converter for converting an input signal into a spectral domain;encoding, in a second coding branch, an audio signal using a secondcoding algorithm to acquire a second encoded signal, wherein the firstcoding algorithm is different from the second coding algorithm, thesecond coding branch comprising a domain converter for converting aninput signal from an input domain into an output domain, and a secondconverter for converting an input signal into a spectral domain;switching between the first coding branch and the second coding branchso that, for a portion of the audio input signal, either the firstencoded signal or the second encoded signal is in an encoder outputsignal; analyzing the portion of the audio signal to determine, whetherthe portion of the audio signal is represented as the first encodedsignal or the second encoded signal in the encoder output signal,variably determining a respective time/frequency resolution of the firstconverter and the second converter, when the first encoded signal or thesecond encoded signal representing the portion of the audio signal isgenerated; and generating an encoder output signal comprising the firstencoded signal and the second encoded signal and information indicatingthe first encoded signal and the second encoded signal, and informationindicating the time/frequency resolution applied for encoding the firstencoded signal and for encoding the second encoded signal.
 9. Audiodecoder for decoding an encoded signal, the encoded signal comprising afirst encoded signal, a second encoded signal, an indication indicatingthe first encoded signal and the second encoded signal, and atime/frequency resolution information to be used for decoding the firstencoded signal and the second encoded audio signal, comprising: a firstdecoding branch for decoding the first encoded signal using a firstcontrollable frequency/time converter, the first controllablefrequency/time converter being configured for being controlled using thetime/frequency resolution information for the first encoded signal toacquire a first decoded signal; a second decoding branch for decodingthe second encoded signal using a second controllable frequency/timeconverter, the second controllable frequency/time converter beingconfigured for being controlled using the time/frequency resolutioninformation for the second encoded signal; a controller for controllingthe first frequency/time converter and the second frequency/timeconverter using the time/frequency resolution information; a domainconverter for generating a synthesis signal using the second decodedsignal; and a combiner for combining the first decoded signal and thesynthesis signal to acquire a decoded audio signal.
 10. Audio decoder inaccordance with claim 9, in which the second decoding branch comprises afirst inverse processing branch for inverse processing a first processedsignal being additionally comprised in the encoded signal to acquire afirst inverse processed signal; wherein the second controllablefrequency/time converter is located in a second inverse processingbranch configured for inverse processing the second encoded signal in adomain identical to the domain of the first inverse processed signal toacquire a second inverse processed signal; a further combiner forcombining the first inverse processed signal and the second inverseprocessed signal to acquire a combined signal; and wherein the combinedsignal is input into the combiner.
 11. Audio decoder in accordance withclaim 9, in which the first frequency/time converter and the secondfrequency/time converter are time domain aliasing cancellationconverters comprising an overlap/add unit for canceling a time-domainaliasing comprised in the first encoded signal and the second encodedsignal.
 12. Audio decoder in accordance with claim 9, in which theencoded signal comprises coding mode information identifying, whether anencoded signal is the first encoded signal and the second encodedsignal, and wherein the decoder further comprises an input interface forinterpreting the coding mode information to determine, whether theencoded signal is to be fed either into the first decoding branch orinto the second decoding branch.
 13. Audio decoder in accordance withclaim 9, in which the first encoded signal is arithmetically encoded,and wherein the first coding branch comprises an arithmetic decoder. 14.Audio decoder in accordance with claim 9, in which the first codingbranch comprises a dequantizer comprising a non-uniform dequantizationcharacteristic for canceling a result of a non-uniform quantizationapplied when generating the first encoded signal, wherein the secondcoding branch comprises a dequantizer using a dequantizationcharacteristic being different from the non-uniform dequantizationcharacteristic, or wherein the second coding branch does not comprise adequantizer at all.
 15. Audio decoder in accordance with claim 9, inwhich the controller is configured for controlling the firstfrequency/time converter and the second frequency/time converter byapplying, for each converter, a discrete frequency/time resolution of anumber of possible different discrete frequency/time resolutions, thenumber of possible different frequency/time resolutions being higher forthe second converter compared to the number of possible differentfrequency/time resolutions for the first converter.
 16. Audio decoder inaccordance with claim 9, in which the domain converter is an LPCsynthesis processor generating the synthesis signal using a PC filterinformation, the LPC filter information being comprised in the encodedsignal.
 17. Method of audio decoding an encoded signal, the encodedsignal comprising a first encoded signal, a second encoded signal, anindication indicating the first encoded signal and the second encodedsignal, and a time/frequency resolution information to be used fordecoding the first encoded signal and the second encoded audio signal,comprising: decoding, by a first decoding branch, the first encodedsignal using a first controllable frequency/time converter, the firstcontrollable frequency/time converter being configured for beingcontrolled using the time/frequency resolution information for the firstencoded signal to acquire a first decoded signal; decoding, by a seconddecoding branch, the second encoded signal using a second controllablefrequency/time converter, the second controllable frequency/timeconverter being configured for being controlled using the time/frequencyresolution information for the second encoded signal; controlling thefirst frequency/time converter and the second frequency/time converterusing the time/frequency resolution information; generating, by a domainconverter, a synthesis signal using the second decoded signal; andcombining the first decoded signal and the synthesis signal to acquire adecoded audio signal.
 18. A non-transitory storage medium having storedthereon a computer program for performing, when running on a processor,a method of audio encoding an audio signal, comprising: encoding, in afirst coding branch, an audio signal using a first coding algorithm toacquire a first encoded signal, the first coding branch comprising thefirst converter for converting an input signal into a spectral domain;encoding, in a second coding branch, an audio signal using a secondcoding algorithm to acquire a second encoded signal, wherein the firstcoding algorithm is different from the second coding algorithm, thesecond coding branch comprising a domain converter for converting aninput signal from an input domain into an output domain, and a secondconverter for converting an input signal into a spectral domain;switching between the first coding branch and the second coding branchso that, for a portion of the audio input signal, either the firstencoded signal or the second encoded signal is in an encoder outputsignal; analyzing the portion of the audio signal to determine, whetherthe portion of the audio signal is represented as the first encodedsignal or the second encoded signal in the encoder output signal,variably determining a respective time/frequency resolution of the firstconverter and the second converter, when the first encoded signal or thesecond encoded signal representing the portion of the audio signal isgenerated; and generating an encoder output signal comprising the firstencoded signal and the second encoded signal and information indicatingthe first encoded signal and the second encoded signal, and informationindicating the time/frequency resolution applied for encoding the firstencoded signal and for encoding the second encoded signal or the methodof audio decoding an encoded signal, the encoded signal comprising afirst encoded signal, a second encoded signal, an indication indicatingthe first encoded signal and the second encoded signal, and atime/frequency resolution information to be used for decoding the firstencoded signal and the second encoded audio signal, comprising:decoding, by a first decoding branch, the first encoded signal using afirst controllable frequency/time converter, the first controllablefrequency/time converter being configured for being controlled using thetime/frequency resolution information for the first encoded signal toacquire a first decoded signal; decoding, by a second decoding branch,the second encoded signal using a second controllable frequency/timeconverter, the second controllable frequency/time converter beingconfigured for being controlled using the time/frequency resolutioninformation for the second encoded signal; controlling the firstfrequency/time converter and the second frequency/time converter usingthe time/frequency resolution information; generating, by a domainconverter, a synthesis signal using the second decoded signal; andcombining the first decoded signal and the synthesis signal to acquire adecoded audio signal.